Lesson 12: Nonparametrics

Read Sections: From The Little SAS Book

Review Section 7.1-7.2 of The Little SAS Book.

Nonparametric statistics

Many familiar statistical procedures begin with the assumption that the data within each sample are normally distributed, or that the averages from random samples are normally distributed. If your data are not normally distributed, or if the sample sizes are too small to verify normality, then you may want to analyze the data with different procedures. Nonparametric statistical procedures use signs (indicators of whether a number is positive, negative, or zero), counts, and ranks instead of means and standard deviations, so they do not require normally-distributed data.

Standardization

Standardization is not necessarily a nonparametric procedure, but it is related to the idea that we may be more concerned about rankings than actual numerical measurements. There may be situations in which you need to adjust and rescale observations to have a different mean and standard deviation. For example, suppose that a professor wants the scores on his midterm to have a mean of 75 and a standard deviation of 10. After giving the test, though, he observed the following distribution of grades:

data midterm;
 input grade @@;
 datalines;
 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80 
 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77
 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77 
 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62
;
proc univariate data=midterm plot;
 var grade;
run;

PROC UNIVARIATE calculates the following:

 Moments                                                       

   
 N                60                          
 Mean       69.06667
 Std Dev    11.60489

          Stem Leaf                     #             Boxplot
             9 5                        1                |
             9 00                       2                |
             8 9                        1                |
             8 000124444                9                |
             7 7779                     4             +-----+
             7 00011123334             11             |     |
             6 6666888899              10             *--+--*
             6 0012222333444           13             +-----+
             5 5999                     4                |
             5 2                        1                |
             4 9                        1                |
             4 033                      3                |
               ----+----+----+----+
           Multiply Stem.Leaf by 10**+1

To adjust the grades to have a mean of 75 and a standard deviation of 10, as desired, the professor could multiply the actual scores by (10/11.605) = .86, then add 15.5 points (new mean=69.1x.86=59.5). SAS can do this automatically with PROC STANDARD, as shown below.

proc standard data=midterm out=adjusted mean=75 std=10;
 var grade;
run;

The new dataset ADJUSTED has one variable, also called GRADE, which has a mean of 75 and a standard deviation of 10. For example, the grade of 95 in the MIDTERM dataset becomes a grade of 97.2 in the ADJUSTED dataset ((95-69.1)x0.86+75=97.2).

One-sample tests of location

Recall the example with grades on a midterm exam. We may want to perform a formal test of whether the "typical grade" was different from 75, assuming that these students were a random sample from some larger population. PROC UNIVARIATE in SAS automatically performs three tests of location, one parametric and two nonparametric, but it does so by testing if the "typical" value is zero. We can coerce SAS to do the tests at a null value of 75 as follows:

data midterm;
 input grade @@;
 /* If GRADE is typically 75, then GRADE-75 should typically be
zero. */
 diff=grade-75;
 label diff='Points above 75';
 datalines;
 64 71 80 69 55 84 77 63 68 90 66 61 84 43 80
 66 68 89 71 59 52 62 60 79 43 63 68 72 60 77
 80 73 40 74 63 68 95 66 59 70 73 62 64 62 77
 81 73 64 82 59 84 70 70 71 49 90 84 66 69 62
;
proc univariate data=midterm;
 var diff;
run;

This produces the following:

Univariate Procedure

Variable=DIFF          Points above 75

                 Moments

 N                60  Sum Wgts         60
 Mean       -5.93333  Sum            -356
 Std Dev    11.60489  Variance   134.6734
 Skewness   -0.17772  Kurtosis   0.250711
 USS           10058  CSS        7945.733
 CV         -195.588  Std Mean   1.498185
 T:Mean=0   -3.96035  Pr>|T|       0.0002
 Num ^= 0         60  Num > 0          17
 M(Sign)         -13  Pr>=|M|      0.0011
 Sgn Rank     -481.5  Pr>=|S|      0.0002

First, SAS performs the two-tailed normal-theory test of whether the mean value of DIFF of the population is different from zero (which is equivalent to Ho: mean grade = 75). Here, the low value of t (-3.96) and low p-value (.0002) provide evidence that the population mean grade is less than 75. M(Sign) = 13, which is calculated from the 43 negative and 17 positive observations. If 75 had been the true median score, we would have expected to see 30 negative and 30 positive observations (and 30-43=-13). Fisher's sign test based on M has a p-value of .0011, which indicates that the median of the distribution of grades is significantly lower than 75. Finally, the number -481.5 is related to the Wilcoxon signed rank test, which also tests whether the median difference is zero while assuming a symmetric distribution of the differences. The signed rank test uses the signs (positive or negative) of DIFF and the ranks of the absolute values of DIFF. This test also leads to rejection of the hypothesis that the median grade is 75 (p=.0002).

Ranking

Many nonparametric procedures rely on the relative ordering, or ranking, of the observations. For example, suppose that a new Florida resident wants to see if the prices of houses in Gainesville are higher than the prices of homes in smaller towns near Gainesville. She collects a random sample of 10 prices for houses on the market in Gainesville and 10 prices of homes in other cities in Alachua County. Results are shown below.

data homes;
 input location $ price @@;
 datalines;
Gville 74500  Gville 269000  Gville 94500 
Gville 86900  Gville 99900
Gville 91500  Gville 72000   Gville 78000 
Gville 289000 Gville 114000
County 32000  County 125000  County 105900 
County 120000 County 139900
County 72000  County 85000   County 74500 
County 199500 County 2200000
;

The $2.2 million dollar home in Alachua causes problems with familiar analysis of variance techniques. We shouldn't believe that the data for other locations are normally distributed, since they are prone to large outlying values. Instead, we can base our conclusions about the differences in price on the ranks of the prices. We want to know if one location tends to have higher-ranked prices than the other.

One way to obtain the ranks would be to sort the variables in order by price; then, the rank would be the observation number of the sorted dataset. We can also use PROC RANK in SAS to calculate the rankings, as shown below:

proc rank data=homes out=rankdata ties=mean;
 var price;
 ranks rankcost;

proc print data=rankdata;
 format price dollar10.;
run;

SAS produces the following output:

OBS    LOCATION         PRICE    RANKCOST

  1     Gville        $74,500       4.5
  2     Gville       $269,000      18.0
  3     Gville        $94,500      10.0
  4     Gville        $86,900       8.0
  5     Gville        $99,900      11.0
  6     Gville        $91,500       9.0
  7     Gville        $70,000       2.0
  8     Gville        $78,000       6.0
  9     Gville       $289,000      19.0
 10     Gville       $114,000      13.0
 11     County        $32,000       1.0
 12     County       $125,000      15.0
 13     County       $105,900      12.0
 14     County       $120,000      14.0
 15     County       $139,900      16.0
 16     County        $72,000       3.0
 17     County        $85,000       7.0
 18     County        $74,500       4.5
 19     County       $199,500      17.0
 20     County     $2,200,000      20.0

Notice that the price of $74,500 occurred twice. With the TIES=MEAN option, SAS assigns these two values the averages of ranks 4 and 5, or 4.5 each. This practice is common in nonparametrics. With TIES=HIGH, SAS assigns all of the tied values the highest rank that any one of them would have received, while TIES=LOW assigns the lowest rank to each tied observation. Other options available in the PROC RANK statement include DESCENDING, which assigns low ranks to high values and vice versa, and FRACTION, which divides the ranks by the total sample size.

Another way to rank the data would be to create groups of least expensive, inexpensive, moderate, expensive, and forget-it-because-I'm-not-a-football-coach price ranges. PROC RANK can do this with the GROUPS option, as shown below.

proc rank data=homes out=rankdata groups=5;
 var price;
 ranks pricegrp;

proc print data=rankdata;
 format price dollar10.;

run;

This creates five ordinal categories (0=least expensive, 4=most expensive) of price, based on the percentiles of the data. This divides the data into 5 groups with roughly equal numbers of observations in each group. The following output is produced.

OBS    LOCATION         PRICE    PRICEGRP

 1     Gville        $74,500        1   
 2     Gville       $269,000        4   
 3     Gville        $94,500        2   
 4     Gville        $86,900        1   
 5     Gville        $99,900        2   
 6     Gville        $91,500        2   
 7     Gville        $72,000        0   
 8     Gville        $78,000        1   
 9     Gville       $289,000        4   
10     Gville       $114,000        3   
11     County        $32,000        0   
12     County       $125,000        3   
13     County       $105,900        2   
14     County       $120,000        3   
15     County       $139,900        3   
16     County        $72,000        0   
17     County        $85,000        1   
18     County        $74,500        1   
19     County       $199,500        4   
20     County     $2,200,000        4   

As expected, there are 4 twos, threes, and fours in the PRICEGRP rankings. There are 3 zeroes and 5 ones in PRICEGRP, since the duplicated value $74,500 fell on the border between groups.

Finally, PROC RANK can be used to produce a better normal probability plot than the one produced by PROC UNIVARIATE. The following example shows how to do this using PROC RANK to calculate the normal scores. BLOM stands for the Blomberg calculation; if the data are indeed normally distributed, the Blomberg-calculated scores should provide the best straight line. In the example below, we consider the 20 homes to be a random sample of all homes for sale in Alachua County, and we want to see if price or log(price) more closely follows a normal distribution.

data homes;
 set homes;
 logprice=log(price);

proc rank data=homes out=rankdata normal=blom;
 var price logprice;
 ranks norm1 norm2;
proc plot data=rankdata;
 plot price*norm1 logprice*norm2;
run;

SAS produces these plots. The log-transformed price has a distribution which is closer to normal, since the points on its plot fall more closely on a straight line.

    Plot of PRICE*NORM1.  Legend: A = 1 obs, B = 2 obs, etc.

  PRICE|
3000000+
       |
       |
       |
       |
       |                                                   A
2000000+
       |
       |
       |
       |
       |
1000000+
       |
       |
       |
       |                                          A  A
       |                     A A AA AA A A A A A
      0|   A       B    B  A
       +-+------------+------------+------------+------------+-
        -2           -1            0            1            2

                         RANK FOR VARIABLE PRICE

  Plot of LOGPRICE*NORM2.  Legend: A = 1 obs, B = 2 obs, etc.

LOGPRICE|
      16+
        |
        |
        |
        |                                                  A
        |
      14+
        |
        |
        |
        |                                         A  A
        |                                      A
      12+                                    A
        |                           AA A A A
        |          B    B  A A A AA
        |
        |
        |  A
      10+                  
        ++------------+------------+------------+------------+-
        -2           -1            0            1            2

                        RANK FOR VARIABLE LOGPRICE

Comparing typical values of two or more groups

The nonparametric version of analysis of variance is based on ranks. The Mann-Whitney test and the Wilcoxon rank sum test are equivalent nonparametric techniques to compare two groups, while the Kruskal-Wallis test is ordinarily used to compare three or more groups. All of these are available in PROC NPAR1WAY (nonparametric 1-way analysis of variance) in SAS.

Recall the prices of homes in Alachua County. To see if there is a difference in the median price between homes in and out of Gainesville, use the following:

proc npar1way wilcoxon data=homes;
 class location;
 var price;
run;

The WILCOXON option asks for the Mann-Whitney-Wilcoxon rank sum test if two groups are present in the CLASS variable; with three or more groups, the same command performs the Kruskal-Wallis test. This produces the following:

N P A R 1 W A Y  P R O C E D U R E
Wilcoxon Scores (Rank Sums) for Variable PRICE
Classified by Variable LOCATION
                     Sum of    Expected     Std Dev        Mean
  LOCATION    N      Scores    Under H0    Under H0       Score

  Gville     10  100.500000       105.0  13.2237824  10.0500000
  Other      10  109.500000       105.0  13.2237824  10.9500000
Average Scores Were Used for Ties

Here, the "scores" are the ranks from 1-20. The sum of these ranks is 1+2+3+...+20 = 210. If the two groups are equivalent, then each group should have ranks which add to 105.

Wilcoxon 2-Sample Test (Normal Approximation)
(with Continuity Correction of .5)

S =  100.500     Z = -.302485     Prob > |Z| = 0.7623

T-Test Approx. Significance = 0.7656

Kruskal-Wallis Test (Chi-Square Approximation)
CHISQ = 0.11580     DF =  1     Prob > CHISQ = 0.7336

This section prints 3 tests based on the ranks of the data. None of them declare the difference in home prices between locations to be significant. Here, the $2.2 million home is just regarded as "a large number which received the highest rank of 20," so its very high price does not have as much impact on the nonparametric test as it would on the two-sample t-test.

PROC NPAR1WAY can also perform tests to see if there are significant differences in the distribution functions of two or more samples. When there are two factors with no interaction, as in a randomized complete block design, Friedman's chi-square test is a nonparametric test that can be used to examine treatment differences. Friedman's test is available in SAS under PROC FREQ, but it is fairly complicated to perform. PROC FREQ also offers versions of rank sum tests for ordinal data. See SAS documentation for details.

Nonparametric correlation coefficients

Recall that a linear correlation coefficient measures the strength of the tendency for two variables X and Y to follow a straight line (Lesson 10). We may be interested in measuring the tendency for X to increase or decrease with Y, without necessarily following a linear pattern. In PROC CORR, SAS can calculate both Spearman and Kendall correlation coefficients. With the Spearman formula, X is ranked, Y is ranked separately from X, and the Pearson correlation coefficient of the ranks of X and Y is calculated. The Kendall formula is based on the probability of observing Y2>Y1 when X2>X1. For both Spearman and Kendall correlation coefficients, a value of -1 indicates a perfect trend for Y to decrease as X increases, while a value of 1 indicates a perfect trend for Y to increase as X increases.

The following example is taken from page 716 of Elementary Statistics by Robert Johnson (PWS-Kent, Boston, 1992). Suppose that twelve students take a test, and we want to see if the students who finished the test early had higher scores than those who finished later. We don't know the exact time that each student spent on the test, but we do know the order in which the tests were turned in to be graded. Thus, we only have rank data for the time variable, so the Pearson linear correlation coefficient would not be appropriate. (The last person to turn in the test could have taken 30 minutes, an hour, or two hours, and the guess of the exact time would greatly influence the resulting Pearson correlation.) However, both the Spearman and Kendall correlation coefficients could legitimately be used. In SAS, use SPEARMAN and/or KENDALL as options in PROC CORR.

data students;
 input order grade @@;
 datalines;
 1 90 2 74 3 76 4 60 5 68 6 86 
 7 92 8 60 9 78 10 70 11 78 12 64
;
proc plot data=students;
 plot grade*order;
proc corr data=students spearman kendall;
 var grade;
 with order;
run;

The plot shows no obvious trends for grades to decrease or increase with order.

    Plot of GRADE*ORDER.  Legend: A = 1 obs, B = 2 obs, etc.

GRADE|
  100+
     |
     |                              A
     |A
     |                         A
     |
   80+
     |          A                             A         A
     |     A
     |                                             A
     |                    A
     |                                                       A
   60+               A                   A
     ++----+----+----+----+----+----+----+----+----+----+----+-
      1    2    3    4    5    6    7    8    9   10   11   12

                                 ORDER

The two correlation coefficients are shown below. Neither one indicates a significant trend for grades to increase or decrease with the time needed to complete the test.

Correlation Analysis

Spearman Correlation Coefficients / Prob > |R| under Ho: Rho=0
/ N = 12

                  GRADE

ORDER          -0.17544
                 0.5855


Kendall Tau b Correlation Coefficients
/ Prob > |R| under Ho: Rho=0 / N = 12

                  GRADE

ORDER          -0.12309
                 0.5815


Homework problems for this lesson

Return to STA 5106 home page