The general linear model in statistics has the form Y = b0 + b1X1 + b2X2 + ... + bkXk + e, where e represents a random error term which is assumed to be normally distributed with mean 0 and constant variance and does not depend on the value of any other observation. A linear regression model is one example of a GLM. Analysis of variance (ANOVA) and analysis of covariance (ANOCOVA) models are also examples of GLMs. These use indicator variables to represent the different categorical levels of a factor. An example of an indicator variable is X1=1 if the subject is male, 0 if female. Then, b1 represents the mean difference between males and females.
The most commonly used procedure for fitting these models in SAS is PROC GLM. The Little SAS Book describes PROC ANOVA rather than PROC GLM. However, PROC GLM follows the same general syntax as PROC ANOVA and is more versatile. Also, PROC ANOVA was developed specifically for balanced data, or data in which all combinations of factors are observed an equal number of times. This situation does not always occur. For these reasons, we will learn to use PROC GLM rather than PROC ANOVA.
When fitting a general linear model, the first statement is PROC GLM. Next, a CLASS statement is used. Here, list any variables whose values should be regarded to be categories. For example, suppose that temperature is an effect in your model, with values of 1, 2, 3, and 4. If you specify the temperature in the CLASS statement, then SAS will fit a model which investigates differences among the four levels individually. For example, it will be possible for SAS to find that temperatures 1 and 3 have higher responses than temperatures 2 and 4. However, if you leave temperature out of the CLASS statement, then SAS will assume that 1, 2, 3, and 4 are numerical measurements of temperature, such as degrees Celsius, to be fitted with a regression coefficient. In other words, you will be telling SAS that you expect some constant gradient to reflect the differences among 1, 2, 3, and 4, and you want to estimate the number that reflects the change in the response for each one-unit difference in temperature. A CLASS statement must list all character variables to be used in the model. If you do not use a CLASS statement in PROC GLM, then SAS will fit a linear regression model.
Next, the MODEL statement is used to specify the response and its predictors. Models involving categorical data, such as ANOVA and ANOCOVA models, can have complicated terms to indicate interaction and nesting terms, and these must be carefully specified in the MODEL statement. The interaction of two factors A and B is used when we think that the differences among levels of B may depend on the level of A; this term is specified as A*B. If B is nested within A, then each level of A is measured with several levels of B, but the levels of B change among levels of A. This term is written as B(A). Do not write terms such as A*B and B(A) in the CLASS statement; they may appear in the MODEL statement.
Suppose that we want to examine the effects of two types of fertilizer (FERTILIZ) on strawberry yields (YIELD). To do this, we use two varieties (VARIETY) of strawberry and use three different rates (RATE) of each fertilizer. Two replicates (REPLICAT) are measured for each set of conditions. The dataset may look like this:
DATA berry; INPUT fertiliz $ variety $ rate replicat yield @@; DATALINES; K Red .3 1 9.1 K Red .3 2 9.0 K Red .6 1 8.7 K Red .6 2 8.4 K Red .9 1 8.0 K Red .9 2 8.4 K Sweet .3 1 9.3 K Sweet .3 2 9.2 K Sweet .6 1 9.0 K Sweet .6 2 8.7 K Sweet .9 1 8.3 K Sweet .9 2 8.5 N Red .3 1 8.4 N Red .3 2 8.8 N Red .6 1 8.8 N Red .6 2 8.9 N Red .9 1 9.0 N Red .9 2 8.9 N Sweet .3 1 8.7 N Sweet .3 2 9.0 N Sweet .6 1 9.2 N Sweet .6 2 9.3 N Sweet .9 1 9.1 N Sweet .9 2 9.5 ;
A number of different models may be used to predict yields. Some are shown below.
PROC GLM DATA=berry; CLASS variety; MODEL yield=variety;
PROC GLM DATA=berry; CLASS variety fertiliz; MODEL yield=variety*fertiliz;
PROC GLM DATA=berry; CLASS rate; MODEL yield=rate;
PROC GLM DATA=berry; MODEL yield=rate;
PROC GLM DATA=berry; MODEL yield=rate rate*rate;
PROC GLM DATA=berry; CLASS variety; MODEL yield=variety rate;
PROC GLM DATA=berry; CLASS variety fertiliz; MODEL yield=variety fertiliz;
PROC GLM DATA=berry; CLASS variety fertiliz; MODEL yield=variety fertiliz variety*fertiliz; /* Equivalently: MODEL yield=variety fertiliz fertiliz*variety; */
PROC GLM DATA=berry; CLASS variety fertiliz rate; MODEL yield=variety fertiliz rate variety*fertiliz variety*rate fertiliz*rate variety*fertiliz*rate;
PROC GLM DATA=berry; CLASS fertiliz rate; MODEL yield=fertiliz rate(fertiliz);
PROC GLM DATA=berry; CALSS fertiliz; MODEL yield=fertiliz rate(fertiliz);
By default, SAS will include a constant term, or intercept, in the model.
Which model should be used? This depends on ...
SAS can provide the numbers you will need to conduct your own data analyses for any approach which involves linear models, but you should consult an appropriate reference when deciding which model to use. Recently released PROC MIXED is particularly powerful in the mixture of random and fixed effects.
To examine the output of PROC GLM, consider the example in which yield is modeled as a function of strawberry variety, type of fertilizer, and their interaction.
PROC GLM DATA=berry; CLASS fertiliz variety; MODEL yield=fertiliz variety fertiliz*variety/SOLUTION; run;
The SOLUTION statement is useful for showing the relative effect sizes. This produces the following output.
General Linear Models Procedure Class Level Information Class Levels Values FERTILIZ 2 K N VARIETY 2 Red Sweet Number of observations in data set = 24
This section lets us verify that we have two fertilizers and two varieties of interest, and that there are 24 observations in the data. Information about missing observations is also printed here, if applicable.
Dependent Variable: YIELD
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 0.87166667 0.29055556 2.59 0.0816
Error 20 2.24666667 0.11233333
Corrected Total 23 3.11833333
R-Square C.V. Root MSE YIELD Mean
0.279530 3.790707 0.3351617 8.8416667
This section shows the ANOVA table, with degrees of freedom (DF), sums of squares, and an F value which tests whether any of the terms in the model are significant. The C. V. (coefficient of variation) is (root MSE/mean yield)(100%). R-Square is the model sum of squares divided by total sum of squares. This is commonly used to evaluate how well the model fits the data, but it should not be the only criterion of fit that you examine.
Source DF Type I SS Mean Square F Value Pr > F FERTILIZ 1 0.37500000 0.37500000 3.34 0.0826 VARIETY 1 0.48166667 0.48166667 4.29 0.0515 FERTILIZ*VARIETY 1 0.01500000 0.01500000 0.13 0.7186 Source DF Type III SS Mean Square F Value Pr > F FERTILIZ 1 0.37500000 0.37500000 3.34 0.0826 VARIETY 1 0.48166667 0.48166667 4.29 0.0515 FERTILIZ*VARIETY 1 0.01500000 0.01500000 0.13 0.7186
In this section, SAS presents Type I and Type III sums of squares and F statistics for their significance under a particular set of assumptions; namely, that fertilizer and variety should be modeled with fixed effects, and that the random error terms satisfy their requirements. The F test statistics shown here are not always the proper results to interpret! This depends on the design of the experiment.
The Type I sums of squares are also called sequential sums of squares. Here, they test:
The Type III sums of squares are also called partial sums of squares. Here, they test:
Because the experiment is balanced, both Type I and Type III sums of squares are identical. Usually, the Type III sums of squares are used for inference, although the Type I sums of squares are used in specific situations. SAS can calculate Type II and Type IV sums of squares as well.
T for H0: Pr > |T| Std
Error
Parameter Estimate Parameter=0 Of Estimate
INTERCEPT 9.13 B 66.75 0.001 0.137
FERTILIZ K -0.30 B -1.55 0.137 0.194
N 0.00 B . . .
VARIETY Red -0.33 B -1.72 0.100 0.194
Sweet 0.00 B . . .
FERTILIZ*VARIETY K Red 0.10 B 0.37 0.719 0.274
K Sweet 0.00 B . . .
N Red 0.00 B . . .
N Sweet 0.00 B . . .
This output was requested by the SOLUTION option. There are many ways to estimate effects in a linear model with categorical predictors. SAS chooses to do so by alphabetizing the levels of each factor, then assigning an effect size of zero to the last alphabetically-ordered level of each factor and its interactions. To predict the response for, say, Fertilizer K for the Red variety, use the equation (Intercept) + (K effect) + (Red effect) + (K*Red interaction effect), or 9.13 - 0.30 - 0.33 + 0.10 = 8.60. The t-test values listed on the right can be used to test if certain parameters are significantly different from zero; in this case, they compare the levels of each factor to the last alphabetically-ordered level (which is forced to be zero). The SOLUTION statement is useful for determining how treatment effects can be contrasted or estimated within PROC GLM.
NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.
This message reminds you that SAS chose one particular way of deriving its estimates. We could have obtained equivalent estimates by forcing the first level of each variable to have an effect of size 3, or by forcing the average of all levels of a factor to be zero, for example.
An analysis of a general linear model should include a check of the assumptions about the random error terms. To do this in PROC GLM, you must use an OUTPUT statement. The following statements show how to produce a residual plot for the model above.
PROC GLM DATA=berry; CLASS fertiliz variety; MODEL yield=fertiliz variety fertiliz*variety/SOLUTION; OUTPUT OUT=results P=pred R=resid; PROC GLM DATA=results; LPOT resid*pred; RUN;
The following plot is produced.
Plot of RESID*PRED. Legend: A = 1 obs, B = 2 obs, etc.
RESID |
0.5 + A A
| A A A
| A
| A B A B
0.0 + B A
| A A
| B
| A A A
-0.5 + A
| A
|
|
-1.0 +
+-+--------+--------+--------+--------+--------+--------+-
8.6 8.7 8.8 8.9 9.0 9.1
9.2
PRED
There are only four unique predicted values, corresponding to the sample means for K & Red, K & Sweet, N & Red, and N & Sweet. Based on this plot, there is no compelling evidence that the variability of the residuals depends on the treatment. If the residuals form a pattern, such as a funnel or football shape, then the inferences from the F tests may not be valid.
PROC GLM is a very diverse tool. It also provides the following capabilities:
Consult SAS documentation for details.
SAS provides several other procedures for analyzing linear models. Most of them follow the same general syntax as PROC GLM.