Read Sections: From The Little SAS Book
Review Section 7.1-7.2 of The Little SAS Book.
Many familiar statistical procedures begin with the assumption that the data within each sample are normally distributed, or that the averages from random samples are normally distributed. If your data are not normally distributed, or if the sample sizes are too small to verify normality, then you may want to analyze the data with different procedures. Nonparametric statistical procedures use signs (indicators of whether a number is positive, negative, or zero), counts, and ranks instead of means and standard deviations, so they do not require normally-distributed data.
Standardization is not necessarily a nonparametric procedure, but it is related to the idea that we may be more concerned about rankings than actual numerical measurements. There may be situations in which you need to adjust and rescale observations to have a different mean and standard deviation. For example, suppose that a professor wants the scores on his midterm to have a mean of 75 and a standard deviation of 10. After giving the test, though, he observed the following distribution of grades:
data midterm; input grade @@; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62 ; proc univariate data=midterm plot; var grade; run;
PROC UNIVARIATE calculates the following:
Moments
N 60
Mean 69.06667
Std Dev 11.60489
Stem Leaf # Boxplot
9 5 1 |
9 00 2 |
8 9 1 |
8 000124444 9 |
7 7779 4 +-----+
7 00011123334 11 | |
6 6666888899 10 *--+--*
6 0012222333444 13 +-----+
5 5999 4 |
5 2 1 |
4 9 1 |
4 033 3 |
----+----+----+----+
Multiply Stem.Leaf by 10**+1
To adjust the grades to have a mean of 75 and a standard deviation of 10, as desired, the professor could multiply the actual scores by (10/11.605) = .86, then add 15.5 points (new mean=69.1x.86=59.5). SAS can do this automatically with PROC STANDARD, as shown below.
proc standard data=midterm out=adjusted mean=75 std=10; var grade; run;
The new dataset ADJUSTED has one variable, also called GRADE, which has a mean of 75 and a standard deviation of 10. For example, the grade of 95 in the MIDTERM dataset becomes a grade of 97.2 in the ADJUSTED dataset ((95-69.1)x0.86+75=97.2).
Recall the example with grades on a midterm exam. We may want to perform a formal test of whether the "typical grade" was different from 75, assuming that these students were a random sample from some larger population. PROC UNIVARIATE in SAS automatically performs three tests of location, one parametric and two nonparametric, but it does so by testing if the "typical" value is zero. We can coerce SAS to do the tests at a null value of 75 as follows:
data midterm; input grade @@; /* If GRADE is typically 75, then GRADE-75 should typically be zero. */ diff=grade-75; label diff='Points above 75'; datalines; 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62 ; proc univariate data=midterm; var diff; run;
This produces the following:
Univariate Procedure
Variable=DIFF Points above 75
Moments
N 60 Sum Wgts 60
Mean -5.93333 Sum -356
Std Dev 11.60489 Variance 134.6734
Skewness -0.17772 Kurtosis 0.250711
USS 10058 CSS 7945.733
CV -195.588 Std Mean 1.498185
T:Mean=0 -3.96035 Pr>|T| 0.0002
Num ^= 0 60 Num > 0 17
M(Sign) -13 Pr>=|M| 0.0011
Sgn Rank -481.5 Pr>=|S| 0.0002
First, SAS performs the two-tailed normal-theory test of whether the mean value of DIFF of the population is different from zero (which is equivalent to Ho: mean grade = 75). Here, the low value of t (-3.96) and low p-value (.0002) provide evidence that the population mean grade is less than 75. M(Sign) = 13, which is calculated from the 43 negative and 17 positive observations. If 75 had been the true median score, we would have expected to see 30 negative and 30 positive observations (and 30-43=-13). Fisher's sign test based on M has a p-value of .0011, which indicates that the median of the distribution of grades is significantly lower than 75. Finally, the number -481.5 is related to the Wilcoxon signed rank test, which also tests whether the median difference is zero while assuming a symmetric distribution of the differences. The signed rank test uses the signs (positive or negative) of DIFF and the ranks of the absolute values of DIFF. This test also leads to rejection of the hypothesis that the median grade is 75 (p=.0002).
Many nonparametric procedures rely on the relative ordering, or ranking, of the observations. For example, suppose that a new Florida resident wants to see if the prices of houses in Gainesville are higher than the prices of homes in smaller towns near Gainesville. She collects a random sample of 10 prices for houses on the market in Gainesville and 10 prices of homes in other cities in Alachua County. Results are shown below.
data homes; input location $ price @@; datalines; Gville 74500 Gville 269000 Gville 94500 Gville 86900 Gville 99900 Gville 91500 Gville 72000 Gville 78000 Gville 289000 Gville 114000 County 32000 County 125000 County 105900 County 120000 County 139900 County 72000 County 85000 County 74500 County 199500 County 2200000 ;
The $2.2 million dollar home in Alachua causes problems with familiar analysis of variance techniques. We shouldn't believe that the data for other locations are normally distributed, since they are prone to large outlying values. Instead, we can base our conclusions about the differences in price on the ranks of the prices. We want to know if one location tends to have higher-ranked prices than the other.
One way to obtain the ranks would be to sort the variables in order by price; then, the rank would be the observation number of the sorted dataset. We can also use PROC RANK in SAS to calculate the rankings, as shown below:
proc rank data=homes out=rankdata ties=mean; var price; ranks rankcost; proc print data=rankdata; format price dollar10.; run;
SAS produces the following output:
OBS LOCATION PRICE RANKCOST 1 Gville $74,500 4.5 2 Gville $269,000 18.0 3 Gville $94,500 10.0 4 Gville $86,900 8.0 5 Gville $99,900 11.0 6 Gville $91,500 9.0 7 Gville $70,000 2.0 8 Gville $78,000 6.0 9 Gville $289,000 19.0 10 Gville $114,000 13.0 11 County $32,000 1.0 12 County $125,000 15.0 13 County $105,900 12.0 14 County $120,000 14.0 15 County $139,900 16.0 16 County $72,000 3.0 17 County $85,000 7.0 18 County $74,500 4.5 19 County $199,500 17.0 20 County $2,200,000 20.0
Notice that the price of $74,500 occurred twice. With the TIES=MEAN option, SAS assigns these two values the averages of ranks 4 and 5, or 4.5 each. This practice is common in nonparametrics. With TIES=HIGH, SAS assigns all of the tied values the highest rank that any one of them would have received, while TIES=LOW assigns the lowest rank to each tied observation. Other options available in the PROC RANK statement include DESCENDING, which assigns low ranks to high values and vice versa, and FRACTION, which divides the ranks by the total sample size.
Another way to rank the data would be to create groups of least expensive, inexpensive, moderate, expensive, and forget-it-because-I'm-not-a-football-coach price ranges. PROC RANK can do this with the GROUPS option, as shown below.
proc rank data=homes out=rankdata groups=5; var price; ranks pricegrp; proc print data=rankdata; format price dollar10.; run;
This creates five ordinal categories (0=least expensive, 4=most expensive) of price, based on the percentiles of the data. This divides the data into 5 groups with roughly equal numbers of observations in each group. The following output is produced.
OBS LOCATION PRICE PRICEGRP 1 Gville $74,500 1 2 Gville $269,000 4 3 Gville $94,500 2 4 Gville $86,900 1 5 Gville $99,900 2 6 Gville $91,500 2 7 Gville $72,000 0 8 Gville $78,000 1 9 Gville $289,000 4 10 Gville $114,000 3 11 County $32,000 0 12 County $125,000 3 13 County $105,900 2 14 County $120,000 3 15 County $139,900 3 16 County $72,000 0 17 County $85,000 1 18 County $74,500 1 19 County $199,500 4 20 County $2,200,000 4
As expected, there are 4 twos, threes, and fours in the PRICEGRP rankings. There are 3 zeroes and 5 ones in PRICEGRP, since the duplicated value $74,500 fell on the border between groups.
Finally, PROC RANK can be used to produce a better normal probability plot than the one produced by PROC UNIVARIATE. The following example shows how to do this using PROC RANK to calculate the normal scores. BLOM stands for the Blomberg calculation; if the data are indeed normally distributed, the Blomberg-calculated scores should provide the best straight line. In the example below, we consider the 20 homes to be a random sample of all homes for sale in Alachua County, and we want to see if price or log(price) more closely follows a normal distribution.
data homes; set homes; logprice=log(price); proc rank data=homes out=rankdata normal=blom; var price logprice; ranks norm1 norm2; proc plot data=rankdata; plot price*norm1 logprice*norm2; run;
SAS produces these plots. The log-transformed price has a distribution which is closer to normal, since the points on its plot fall more closely on a straight line.
Plot of PRICE*NORM1. Legend: A = 1 obs, B = 2 obs, etc.
PRICE|
3000000+
|
|
|
|
| A
2000000+
|
|
|
|
|
1000000+
|
|
|
| A A
| A A AA AA A A A A A
0| A B B A
+-+------------+------------+------------+------------+-
-2 -1 0 1 2
RANK FOR VARIABLE PRICE
Plot of LOGPRICE*NORM2. Legend: A = 1 obs, B = 2 obs, etc.
LOGPRICE|
16+
|
|
|
| A
|
14+
|
|
|
| A A
| A
12+ A
| AA A A A
| B B A A A AA
|
|
| A
10+
++------------+------------+------------+------------+-
-2 -1 0 1 2
RANK FOR VARIABLE LOGPRICE
The nonparametric version of analysis of variance is based on ranks. The Mann-Whitney test and the Wilcoxon rank sum test are equivalent nonparametric techniques to compare two groups, while the Kruskal-Wallis test is ordinarily used to compare three or more groups. All of these are available in PROC NPAR1WAY (nonparametric 1-way analysis of variance) in SAS.
Recall the prices of homes in Alachua County. To see if there is a difference in the median price between homes in and out of Gainesville, use the following:
proc npar1way wilcoxon data=homes; class location; var price; run;
The WILCOXON option asks for the Mann-Whitney-Wilcoxon rank sum test if two groups are present in the CLASS variable; with three or more groups, the same command performs the Kruskal-Wallis test. This produces the following:
N P A R 1 W A Y P R O C E D U R E
Wilcoxon Scores (Rank Sums) for Variable PRICE
Classified by Variable LOCATION
Sum of Expected Std Dev Mean
LOCATION N Scores Under H0 Under H0 Score
Gville 10 100.500000 105.0 13.2237824 10.0500000
Other 10 109.500000 105.0 13.2237824 10.9500000
Average Scores Were Used for Ties
Here, the "scores" are the ranks from 1-20. The sum of these ranks is 1+2+3+...+20 = 210. If the two groups are equivalent, then each group should have ranks which add to 105.
Wilcoxon 2-Sample Test (Normal Approximation) (with Continuity Correction of .5) S = 100.500 Z = -.302485 Prob > |Z| = 0.7623 T-Test Approx. Significance = 0.7656 Kruskal-Wallis Test (Chi-Square Approximation) CHISQ = 0.11580 DF = 1 Prob > CHISQ = 0.7336
This section prints 3 tests based on the ranks of the data. None of them declare the difference in home prices between locations to be significant. Here, the $2.2 million home is just regarded as "a large number which received the highest rank of 20," so its very high price does not have as much impact on the nonparametric test as it would on the two-sample t-test.
PROC NPAR1WAY can also perform tests to see if there are significant differences in the distribution functions of two or more samples. When there are two factors with no interaction, as in a randomized complete block design, Friedman's chi-square test is a nonparametric test that can be used to examine treatment differences. Friedman's test is available in SAS under PROC FREQ, but it is fairly complicated to perform. PROC FREQ also offers versions of rank sum tests for ordinal data. See SAS documentation for details.
Recall that a linear correlation coefficient measures the strength of the tendency for two variables X and Y to follow a straight line (Lesson 10). We may be interested in measuring the tendency for X to increase or decrease with Y, without necessarily following a linear pattern. In PROC CORR, SAS can calculate both Spearman and Kendall correlation coefficients. With the Spearman formula, X is ranked, Y is ranked separately from X, and the Pearson correlation coefficient of the ranks of X and Y is calculated. The Kendall formula is based on the probability of observing Y2>Y1 when X2>X1. For both Spearman and Kendall correlation coefficients, a value of -1 indicates a perfect trend for Y to decrease as X increases, while a value of 1 indicates a perfect trend for Y to increase as X increases.
The following example is taken from page 716 of Elementary Statistics by Robert Johnson (PWS-Kent, Boston, 1992). Suppose that twelve students take a test, and we want to see if the students who finished the test early had higher scores than those who finished later. We don't know the exact time that each student spent on the test, but we do know the order in which the tests were turned in to be graded. Thus, we only have rank data for the time variable, so the Pearson linear correlation coefficient would not be appropriate. (The last person to turn in the test could have taken 30 minutes, an hour, or two hours, and the guess of the exact time would greatly influence the resulting Pearson correlation.) However, both the Spearman and Kendall correlation coefficients could legitimately be used. In SAS, use SPEARMAN and/or KENDALL as options in PROC CORR.
data students; input order grade @@; datalines; 1 90 2 74 3 76 4 60 5 68 6 86 7 92 8 60 9 78 10 70 11 78 12 64 ; proc plot data=students; plot grade*order; proc corr data=students spearman kendall; var grade; with order; run;
The plot shows no obvious trends for grades to decrease or increase with order.
Plot of GRADE*ORDER. Legend: A = 1 obs, B = 2 obs, etc.
GRADE|
100+
|
| A
|A
| A
|
80+
| A A A
| A
| A
| A
| A
60+ A A
++----+----+----+----+----+----+----+----+----+----+----+-
1 2 3 4 5 6 7 8 9 10 11 12
ORDER
The two correlation coefficients are shown below. Neither one indicates a significant trend for grades to increase or decrease with the time needed to complete the test.
Correlation Analysis
Spearman Correlation Coefficients / Prob > |R| under Ho: Rho=0
/ N = 12
GRADE
ORDER -0.17544
0.5855
Kendall Tau b Correlation Coefficients
/ Prob > |R| under Ho: Rho=0 / N = 12
GRADE
ORDER -0.12309
0.5815