Lesson 9: Frequency Tables

SAS Manual: Depertment of Statistics Webpage (www.stat.ufl.edu) => Computing Environment => Online Software Documentation => SAS manuals (UF viewing only) =>SAS Procedures Guide => The FREQ Procedure.

Categorical variables

You may encounter situations in which you need to analyze counts of objects that are classified by one or more categorical variables. Nominal categorical variables are those with two or more possible levels, with no inherent ordering of the levels. Examples include gender, race, and nationality. Ordinal categorical variables have three or more levels. Those levels are ordered, but there is no exact measurable distance between any two levels. Examples include political affiliation (liberal, moderate, conservative), grades in a class (A, B, C, D, F), and ripeness (green, ripe, overripe, rotten).

PROC FREQ

PROC FREQ is the main tool in SAS for obtaining a simple summary of cross-classified data. PROC FREQ shows cell, column, and row totals and proportions, and it can also perform some hypothesis tests on the data. SAS has other PROCs, such as LOGISTIC and GENMOD, for fitting models to categorical data.

Unlike other statistical procedures in SAS, PROC FREQ uses a TABLES statement, rather than a VAR statement, to specify which variables should be analyzed. This is shown in the following simple example.

DATA students;
 INPUT grade $ @@;
 DATALINES;
 A A B B C A B A C D C C B A B B C C B C C B B B C D 
 C D E B B C C C A C I C C D B B B C C B B A C B B A 
 C A B B C B B A C B A C C A B A B C B A A B B C B C
;
PROC FREQ DATA=students;
 TABLES grade;
RUN;

SAS provides the following output.

                             Cumulative  Cumulative
GRADE   Frequency   Percent   Frequency    Percent
---------------------------------------------------
A             15      19.2          15       19.2
B             30      38.5          45       57.7
C             27      34.6          72       92.3
D              4       5.1          76       97.4
E              1       1.3          77       98.7
I              1       1.3          78      100.0

PROC FREQ also handles cross-classified data representing two or more categorical variables. The following data appear in Alan Agresti's Categorical Data Analysis (New York, Wiley, 1990) and were taken from a 1984 study by Mendenhall et.al. Forty-one patients with cancer of the larynx were given either surgery or radiation therapy. Each case was a success or failure, depending on whether the cancer was controlled. The following SAS program shows one way to enter the data and obtain a frequency table.

DATA cancer;
 INPUT treatmnt $ outcome $ @@;
 DATALINES;
s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s
s 
s s s s s s s s s s 
s f s f
r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s 
r f r f r f 
;
PROC FORMAT;
 VALUE $trtfmt 's'='Surgery' 'r'='Radiation';
 VALUE $resfmt 's'='Success' 'f'='Failure';
PROC FREQ DATA=cancer;
 TABLES treatmnt*outcome;
 FORMAT treatmnt $trtfmt. outcome $resfmt.;
RUN;

Instead of listing each observation (patient) as one observation, you may also list the categorical levels and the number of patients within each combination, as shown below. In the PROC FREQ section, you must specify a WEIGHT stetement which tells SAS which variable contains the counts.

DATA cancer;
 INPUT treatmnt $ outcome $ count;
 DATALINES;
s s 21
s f 2
r s 15
r f 3
;
PROC FORMAT;
 VALUE $trtfmt 's'='Surgery' 'r'='Radiation';
 VALUE $resfmt 's'='Success' 'f'='Failure';
PROC FREQ DATA=cancer;
 TABLES treatmnt*outcome;
 WEIGHT count;
 FORMAT treatmnt $trtfmt. outcome $resfmt.;
run;

Both of these programs will produce the table below.

TABLE OF TREATMNT BY OUTCOME

TREATMNT     OUTCOME

Frequency |
Percent   |
Row Pct   |
Col Pct   |Failure |Success |  Total
----------+--------+--------+
Radiation |      3 |     15 |     18
          |   7.32 |  36.59 |  43.90
          |  16.67 |  83.33 |
          |  60.00 |  41.67 |
----------+--------+--------+
Surgery   |      2 |     21 |     23
          |   4.88 |  51.22 |  56.10
          |   8.70 |  91.30 |
          |  40.00 |  58.33 |
----------+--------+--------+
Total            5       36       41
             12.20    87.80   100.00             

In the X1*X2 syntax, levels of the X1 variable appear in the rows; X2, columns. The four numbers in each closed cell of the table are numbers defined in the upper left corner. For example, in the top left cell, 3 patients had radiation therapy and did not improve after treatment. 7.32% of all patients (3/41) were failures after radiation therapy. The last two numbers are conditional probabilities, based on the totals for that row and column. Three out of the eighteen patients who had radiation therapy, or 16.67%, were failures. Three of the five failures, or 60.00%, received radiation therapy. The table also shows marginal totals and percentages.

Is there a difference between treatments in terms or recovery rates? Statistically, we could phrase this in terms of conditional probabilities. We want to know if the conditional probability of success, given that a person received surgery, was different from the conditional probability of success, given that a person received radiation therapy. This could also be phrased in terms of independence of rows as columns as follows: If we know the value of the row variable, does that help us predict the value of the column variable? A commonly used test for this situation is Pearson's chi-square test. This may be requested in SAS by replacing the previous TABLES statement with the one below.

TABLES treatmnt*outcome/CHISQ;

This produces the following additional output.

TABLE OF TREATMNT BY OUTCOME

Statistic                     DF     Value        Prob
------------------------------------------------------
Chi-Square                     1     0.599       0.439
Likelihood Ratio Chi-Square    1     0.595       0.441
Continuity Adj. Chi-Square     1     0.086       0.769
Mantel-Haenszel Chi-Square     1     0.585       0.445
Fisher's Exact Test (Left)                       0.895
                    (Right)                      0.381
                    (2-Tail)                     0.638
Phi Coefficient                      0.121
Contingency Coefficient              0.120
Cramer's V                           0.121

Sample Size = 41
WARNING: 50% of the cells have expected counts less than 5.
Chi-Square
may not be a valid test.

In this case, SAS calculates the value of the chi-square test statistic as 0.599, with a p-value of 0.439. Large p-values, such as the one here, indicate that the difference between proportions is not significant. However, SAS prints the warning about the chi-square test shown above. When the sample sizes are small, Fisher's exact test is more appropriate. In this case, the 2-tailed Fisher's exact test also concludes that the difference between the proportions is not significant (p=0.638). The other numbers shown in the table above are defined in SAS documentation and in textbooks such as Agresti's Categorical Data Analysis.

Fisher's exact test is provided automatically with the CHISQ option if the table has 2 rows and 2 columns. For larger tables, you may request Fisher's exact test as follows:

TABLES treatmnt*outcome/EXACT;

WARNING: If the cell sizes are very large, or if there are many rows or columns, this will take a long time to execute!

You may want to suppress some of the numbers printed in the tables, or you may want to add others. You can insert the following options after a slash in the TABLES statement.

Consider the following statement.

TABLES treatmnt*outcome/NOROW NOCOL NOPERCENT EXPECTED CELLCHI2;

This produces the following simplified table.


 TREATMNT        OUTCOME

 Frequency      |
 Expected       |
 Cell Chi-Square|Failure |Success |  Total
 ---------------+--------+--------+
 Radiation      |      3 |     15 |     18
                | 2.1951 | 15.805 |
                | 0.2951 |  0.041 |
 ---------------+--------+--------+
 Surgery        |      2 |     21 |     23
                | 2.8049 | 20.195 |
                |  0.231 | 0.0321 |
 ---------------+--------+--------+
 Total                 5       36       41

Note that the cell chi-squares 0.2951 + 0.041 + 0.231 +0.0321 becomes the overall chi-square 0.599. How does SAS display three or more categorical variables? Consider the following example, also taken from Categorical Data Analysis. In the fall semester of 1973, the applications to graduate school at the University of California at Berkeley were classified by department, gender, and whether the applicant was accepted.

DATA berkeley;
 INPUT dept $ sex $ accepted $ count @@;
 DATALINES;
 A Male Yes 512 A Male No 313 A Female Yes 89  A Female No 19
 B Male Yes 353 B Male No 207 B Female Yes 17  B Female No 8 
 C Male Yes 120 C Male No 205 C Female Yes 202 C Female No 391
 D Male Yes 138 D Male No 279 D Female Yes 131 D Female No 244
 E Male Yes 53  E Male No 138 E Female Yes 94  E Female No 299
 F Male Yes 22  F Male No 351 F Female Yes 24  F Female No 317
;
PROC FREQ DATA=berkeley;
 TABLES sex*accepted;
 WEIGHT count;
RUN;

The following table is produced.

TABLE OF SEX BY ACCEPTED

SEX       ACCEPTED

Frequency|
Percent  |
Row Pct  |
Col Pct  |No      |Yes     |  Total
---------+--------+--------+
Female   |   1278 |    557 |   1835
         |  28.24 |  12.31 |  40.54
         |  69.65 |  30.35 |
         |  46.12 |  31.74 |
---------+--------+--------+
Male     |   1493 |   1198 |   2691
         |  32.99 |  26.47 |  59.46
         |  55.48 |  44.52 |
         |  53.88 |  68.26 |
---------+--------+--------+
Total        2771     1755     4526
            61.22    38.78   100.00

The acceptance rate for females was 30.35%; males 44.52%. Of the 1755 applicants who were accepted, 1198 (68.26%) were male. GENDER DISCRIMINATION AT BERKELEY?! Not necessarily. You can look at the same table of sex by acceptance within each department as follows:

PROC SORT;
 BY dept;
PROC FREQ DATA=berkeley;
 TABLES sex*accepted;
 WEIGHT count;
 BY dept;

You could also do the following:

PROC FREQ DATA=berkeley;
 TABLES dept*sex*accepted;
 WEIGHT count;

Both of these methods will produce one table for each of the six departments. Close examination of these tables reveals that, within each department, acceptance rates were nearly equal for men and women. In fact, the acceptance rate for women was higher in four of the six departments! This is an example of a phenomenon called Simpson's paradox. More men than women applied to departments with high acceptance rates, while more women than men applied to departments with low acceptance rates. Together, these created the false impression that men had an advantage over women when applying to Berkeley.

If you only need a list of the frequencies within each set of categorical values, you may request for a list to be printed as follows:

PROC FREQ DATA=berkeley;
 TABLES dept*sex*accepted/LIST;
 WEIGHT count;

This produces the following:

                                              Cumul.    Cumul.
DEPT  SEX     ACCEPTED   Frequency  Percent   Frequency Percent
---------------------------------------------------------------
A     Female  No                19     0.4           19   0.4
A     Female  Yes               89     2.0          108   2.4
A     Male    No               313     6.9          421   9.3
A     Male    Yes              512    11.3          933  20.6
(lines deleted)
F     Male    Yes               22     0.5         4526 100.0

Missing values in tables

Suppose that a wildlife researcher in the Everglades writes down the species and direction of flight for every bird he sees. He occasionally misses the species of the bird but knows the direction of flight; he may also get disoriented while trying to identify the species. His dataset may look like this:

DATA birds;
INPUT species flight @@;
DATALINES;
1 4  1 4  3 2  3 1  3 1  3 2  3 .  3 1  2 2  . 2  1 4  1 4
2 3  . 2  3 3  1 1  1 2  2 3  3 3  2 1  3 .  1 4  3 2  3 2
2 4  3 4  3 2  3 1  1 1  . 2  3 1  3 4  3 3  2 2  1 4  2 4
2 3  1 2  2 3  . 1  1 4  2 3  1 4  2 1  3 1  1 4  2 2  3 3
;

The corresponding frequency table could be constructed as follows:

PROC FORMAT;
 VALUE specfmt 1='White ibis' 2='Great blue heron' 3='Wood
stork';
 VALUE dirfmt 1='North' 2='East' 3='South' 4='West';
PROC FREQ DATA=birds;
 TABLES species*flight;
 FORMAT species specfmt. flight dirfmt.;
RUN;

The table would show only the observations in which both values were known and put a note on the output that 6 birds had missing data. However, it might be more informative to include the missing values in the table to see if, for example, all of those birds shared some characteristic. "Missing" may be included as one of the possible values of a variable, as shown below.

PROC FORMAT;
 VALUE specfmt 1='White ibis' 2='Great blue heron' 3='Wood
stork' 
  .='Missing';
 VALUE dirfmt 1='North' 2='East' 3='South' 4='West'
.='Missing';
PROC FREQ DATA=birds;
 TABLES species*flight/MISSING;
 FORMAT species specfmt. flight dirfmt.;
RUN;
The output is as follows.
SPECIES           FLIGHT

 Frequency        |
 Percent          |
 Row Pct          |
 Col Pct          |Missing |North   |East    |South   |West    |  Total
 -----------------+--------+--------+--------+--------+--------+
 Missing          |      0 |      1 |      3 |      0 |      0 |      4
                  |   0.00 |   2.08 |   6.25 |   0.00 |   0.00 |   8.33
                  |   0.00 |  25.00 |  75.00 |   0.00 |   0.00 |
                  |   0.00 |   9.09 |  23.08 |   0.00 |   0.00 |
 -----------------+--------+--------+--------+--------+--------+
 White ibis       |      0 |      2 |      2 |      0 |      9 |     13
                  |   0.00 |   4.17 |   4.17 |   0.00 |  18.75 |  27.08
                  |   0.00 |  15.38 |  15.38 |   0.00 |  69.23 |
                  |   0.00 |  18.18 |  15.38 |   0.00 |  69.23 |
 -----------------+--------+--------+--------+--------+--------+
 Great blue heron |      0 |      2 |      3 |      5 |      2 |     12
                  |   0.00 |   4.17 |   6.25 |  10.42 |   4.17 |  25.00
                  |   0.00 |  16.67 |  25.00 |  41.67 |  16.67 |
                  |   0.00 |  18.18 |  23.08 |  55.56 |  15.38 |
 -----------------+--------+--------+--------+--------+--------+
 Woodstork        |      2 |      6 |      5 |      4 |      2 |     19
                  |   4.17 |  12.50 |  10.42 |   8.33 |   4.17 |  39.58
                  |  10.53 |  31.58 |  26.32 |  21.05 |  10.53 |
                  | 100.00 |  54.55 |  38.46 |  44.44 |  15.38 |
 -----------------+--------+--------+--------+--------+--------+
 Total                   2       11       13        9       13       48
                      4.17    22.92    27.08    18.75    27.08   100.00

From this table, we could easily see that both birds with missing directions were wood storks, and that three of the four birds of unidentified species were flying east.

Ordinal data

When working with data with ordered categories, the order in which the levels of the ordered categories are presented in the table is important, since techniques for ordinal data use information from adjoining categories. Suppose that samples of two peach varieties are picked on the same day and evaluated in terms of ripeness. Each peach is graded as green, ripe, overripe, or rotten; these are ordinal categories. Suppose the data are entered into SAS as follows:

DATA peaches;
 INPUT variety $ ripeness $ count @@;
 DATALINES;
 Adel  green 4 Adel  ripe 12 Adel  overripe 3  Adel  rotten 1
 Byron green 1 Byron ripe 8  Byron overripe 10 Byron rotten 1
PROC FREQ DATA=peaches;
 TABLES variety*ripeness/norow nocol nopercent;
 WEIGHT count;
RUN;

SAS prints the following table.

TABLE OF VARIETY BY RIPENESS

VARIETY     RIPENESS

Frequency|green   |overripe|ripe    |rotten  | Total
---------+--------+--------+--------+--------+
Adel     |      4 |      3 |     12 |      1 |     20
---------+--------+--------+--------+--------|
Byron    |      1 |     10 |      8 |      1 |     20
---------+--------+--------+--------+--------+
Total           5       13       20        2       40

The ripeness categories are shown in alphabetic order, not the correct ordering of ripeness. This problem can be solved by redefining the ordinal variables with numbers, using a FORMAT statement, and using the ORDER option in the PROC FREQ statement.

DATA peaches;
 INPUT variety $ ripeness count @@;
 DATALINES;
 Adel  1 4 Adel  2 12 Adel 3 3   Adel  4 1
 Byron 1 1 Byron 2 8  Byron 3 10 Byron 4 1
PROC FORMAT;
 VALUE ripefmt 1='green' 2='ripe' 3='overripe' 4='rotten';
PROC FREQ DATA=peaches ORDER=INTERNAL;
 TABLES variety*ripeness/norow nocol nopercent;
 WEIGHT count;
 FORMAT ripeness ripefmt.;
RUN;

This produces the correct ordering.

VARIETY     RIPENESS

Frequency|green   |ripe    |overripe|rotten  |  Total
---------+--------+--------+--------+--------+
Adel     |      4 |     12 |      3 |      1 |     20
---------+--------+--------+--------+--------+
Byron    |      1 |      8 |     10 |      1 |     20
---------+--------+--------+--------+--------+
Total           5       20       13        2       40

Output datasets

You may need to make a new dataset with counts of objects in cross-classified categories. For example, you may want to plot the frequencies using better graphics than PROC CHART produces, or you may need to obtain cell counts and perform some operation on them, such as adding 0.001 to cells with no observations, before performing an analysis with log-linear models. You can use the OUT= option in the TABLES statement, as shown below.

PROC FREQ DATA=birds;
 TABLES species*flight/OUT=freqout;
PROC PRINT DATA=freqout;
RUN;

These statements produce the following output:

OBS    SPECIES    FLIGHT    COUNT    PERCENT

  1       .          1        1        .
  2       .          2        3        .
  3       1          1        2       4.7619
  4       1          2        2       4.7619
  5       1          4        9      21.4286
  6       2          1        2       4.7619
  7       2          2        3       7.1429
  8       2          3        5      11.9048
  9       2          4        2       4.7619
 10       3          .        2        .
 11       3          1        6      14.2857
 12       3          2        5      11.9048
 13       3          3        4       9.5238
 14       3          4        2       4.7619

Notice that this listing does not include an entry for Species 1, Flight 3 because there were no birds observed in that cell. To force SAS to include the zeroes in the output dataset, use the SPARSE option, as shown below.

TABLES species*flight/OUT=freqout SPARSE;

The dataset FREQOUT will then contain the counts for each combination of levels of species and flight, including the case in which both are missing. The output (partial) looks like:

OBS    SPECIES    FLIGHT    COUNT    PERCENT

1       .          .        0        .    
2       .          1        1        .    
3       .          2        3        .    
4       .          3        0        .    
5       .          4        0        .    
6       1          .        0        .    
7       1          1        2       4.7619
8       1          2        2       4.7619
9       1          3        0       0.0000
10      1          4        9      21.4286
(more --)

Appendix: A simple example for the Simpson's paradox


  Department A                  Department B

           No     Yes                   No     Yes
         -------------                 -------------
  Female | 100 | 100 | 0.50     Female | 700 | 100 | 0.125
         -------------                 -------------
  Male   | 500 | 400 | 0.44     Male   | 100 |   0 | 0.00
         -------------                 -------------

  Overall

           No     Yes                 
         -------------             
  Female | 800 | 200 | 0.20     
         -------------           
  Male   | 600 | 400 | 0.40    
         -------------      


Homework problems for this lesson

Return to STA 5106 home page