Lesson 8: Plots and Charts

Scatterplots

When analyzing data, the ability to make plots is essential. These plots assist with the following:

SAS can produce both low-resolution plots and high-resolution plots. However, high-resolution graphics must be installed in addition to the basic SAS program, so you may or may not have it available on your system. In addition, the SAS/GRAPH package follows its own language which may not be easy to learn, and those commands may depend on your computer configuration. If you need to make high-quality plots for a presentation, you will find it much easier to use another package (such as Microsoft Excel or Splus).

You can make medium-quality plots using the SAS/INSIGHT package. Like GRAPH, this may or may not be installed on your version of SAS. INSIGHT is a package that can create commonly-used plots and perform both common and specialized statistical analyses. It uses a point-and-click interface with pull-down menus, and it is very easy to learn. If INSIGHT is installed on your version of SAS, you can use it by first using a SAS program to create a dataset (e.g. MYDATA), then use the following comands:

PROC INSIGHT DATA=mydata;
RUN;

So, if INSIGHT is so easy, was it necessary to learn SAS programming? Don't worry; you haven't wasted your time! INSIGHT does not provide some types of analyses, such as nonparametric and mixed-model analysis of variance, that are available through programming. Also, you need to know how to create a dataset in order to use PROC INSIGHT.

The low-resolution plots in SAS are adequate for simple visualization and model checking. PROC PLOT is the basic SAS procedure for producing two-way scatterplots of X and Y variables. It can be modified to show levels of a third variable Z, to plot two sets of data on the same axes, and put multiple plots on one page.

For an example, consider the following data from page 262 of Applied Multivariate Statistical Analysis, Second Edition, by Richard Johnson and Dean Wichern (Englewood Cliffs, NJ: Prentice Hall, 1988). The variables refer to the dimensions of the carapaces (shells) of painted turtles, measured in millimeters.

DATA females;
 INPUT length width height @@;
 DATALINES;
 98 81 38 103 84 38 103 86 42 105 86 42 109 88 44
123 92 50 123 95 46 133 99 51 133 102 51 133 102 51
134 100 48 136 102 49 138 98 51 138 99 51 141 105 53
147 108 57 149 107 55 153 107 56 155 115 63 155 117 60
158 115 62 159 118 63 162 124 61 177 132 67
;
PROC PLOT DATA=females;
 PLOT width*length;
RUN;

In the PLOT statement, the variables are listed in the order Y*X, or (vertical axis variable)*(horizontal axis variable). The resulting plot is shown below.

   Plot of WIDTH*LENGTH.  Legend: A = 1 obs, B = 2 obs, etc.
  140 +
      |
      |                                                    A
      |
      |                                             A
  120 +                                           A
      |                                        B A
WIDTH |
      |                                     AA A
      |                             B A  A
  100 +                             AA B
      |                        A
      |                        A
      |             AA A
      |             A
   80 +          A
      +-----------+----------+----------+----------+----------+-
      80         100        120        140        160       180

                     LENGTH

As expected, there is a positive relationship between the two dimensions. When two pairs of (X, Y) coordinates are too close to each other to be represented individually, SAS indicates that pair with a B (the second letter of the alphabet). Likewise, three closely-spaced points are represented with a C, and so forth. To substitute a single plotting symbol, such as O, for each data point, modify the PLOT statement as follows:

PROC PLOT DATA=females;
 PLOT width*length='O';

SAS then prints this message at the top of the page

Symbol used is 'O'.

followed by the plot, with O's plotted instead of A's, B's, etc. This message appears at the bottom:

NOTE: 3 obs hidden.

Different plot symbols are useful on overlaid plots, in which two or more variables are plotted on one vertical axis with a common horizontal axis. An example appears below.

PROC PLOT DATA=females;
 PLOT width*length='W' height*length='H'/OVERLAY;
RUN;

This produces the following plot.

          Plot of WIDTH*LENGTH.   Symbol used is 'W'.
          Plot of HEIGHT*LENGTH.  Symbol used is 'H'.

WIDTH |
  150 +
      |                                                     W
      |                                         W W W
  100 +                        W    WWWW W  WW W
      |          W  WW W
      |                                         H H H       H
   50 +             HH H       H    HHHH H  HH H
      |          H  H
      |
    0 ++----------+----------+----------+----------+----------+-
      80         100        120        140        160       180

                                LENGTH

To change the labels on the horizontal axis, use the HAXIS option after a slash in the PROC PLOT statement. Likewise, VAXIS can be used to change the vertical axis. Reference lines can be added with HREF and VREF. The following statements tell SAS to make the horizontal axis extend from 90 to 180, with tick marks at 10-mm intervals. Likewise, the vertical axis is to extend from 80 to 140, with a tick mark every 10 mm. A vertical line is to be drawn at the point where X (length) is 150; a horizontal line, at Y (width) = 100.

PROC PLOT DATA=females;
 PLOT width*length/HAXIS=90 TO 180 BY 10
  VAXIS=80 TO 140 BY 10 HREF=150 VREF=100;

The following plot is produced.

WIDTH |                                     |
      |                                     |
  140 +                                     |
      |                                     |
  130 +                                     |               A
      |                                     |      A
  120 +                                     |    A
      |                                     |  B A
  110 +                                   A |
      |                                A   A| A
  100
+---------------------------D-AB------+-------------------
      |                     A               |
   90 +            A        A               |
      |         BA                          |
   80 +      A                              |
      |                                     |
     
+-+-----+-----+-----+-----+-----+-----+-----+-----+-----+-
       90    100   110   120   130   140   150   160   170  
180

                    LENGTH

You may also control the size of the plot with the LINESIZE and PAGESIZE options. The plots above were produced by submitting this statement before reading and plotting the data:

OPTIONS LINESIZE=64 PAGESIZE=24;

There are several ways to create plots to show differences among groups. For example, data were also collected on 24 male turtles. The following SAS code shows one way to put the male and female turtle data into one large dataset, with an identifier for the gender.

DATA males;
 INPUT length width height @@;
 DATALINES;
 93 74 37 94 78 35 96 80 35 101 84 39 102 85 38 103 81 37
104 83 39 106 83 39 107 82 38 112 89 40 113 88 40 114 86 40
116 90 43 117 90 41 117 91 41 119 93 41 120 89 40 120 93 44
121 95 42 125 93 45 127 96 45 128 95 45 131 95 46 135 106 47
;
DATA all;
 SET females(IN=f) males(IN=m);
 IF f=1 THEN gender='Female';
 IF m=1 THEN gender='Male  ';

A simple plot to compare the distributions of lengths of the two genders could be generated as follows:

PROC PLOT DATA=all;
 PLOT length*gender;
RUN;

This produces the following plot:

   Plot of LENGTH*GENDER.  Legend: A = 1 obs, B = 2 obs, etc.

          200 +
              |
              |  A
              |  C
          150 +  E
              |  H                                    A
       LENGTH |  B                                    H
              |  A                                    G
          100 +  D                                    G
              |                                       A
              |
              |
           50 +
              +--+------------------------------------+--
              Female                                Male
                                 GENDER

The following lines of code would produce two separate plots, on two pages, for males and females.

/* Remember to sort BY groups before analyzing BY groups! */
PROC SORT DATA=all;
 BY gender;
PROC PLOT DATA=all;
 PLOT width*length;
 BY gender;

These plots could not be used to fairly compare females to males, since the female horizontal axis spans from 90 to 180 and the male axis spans only from 90 to 140. To force both plots to have the same axes, add UNIFORM in the PROC PLOT statement.

PROC PLOT DATA=all uniform;
 PLOT width*length;
 BY gender;

You may also show both genders on one graph, labeling females with F and males with M, as shown below. SAS will only use the first character of a character variable for labels in PROC PLOT; for example, you would not be able to distinguish Turtles from Terrapins (a North American edible tortoise).

PROC PLOT DATA=all;
 PLOT width*length=gender;

This produces the following plot.

WIDTH |
      |
  140 +
      |                                                     F
      |                                             F
  120 +                                         F F
      |                                         F F
      |                              M   F  FF F
  100 +                             FFFF
      |                    MMMMFMM M
      |            MFF F MM  M
   80 +        MMF  MMM
      |       M
      |
   60 +
      |
      ++----------+----------+----------+----------+----------+-
      80         100        120        140        160       180

                                LENGTH
NOTE: 11 obs hidden.

Next, you may want to look at the relationships between pairs of all three numeric variables. You can request SAS to form several plots in successive order by listing them in one PLOT statement, as shown below.

PROC PLOT DATA=all;
 PLOT width*length width*height height*length;

This will print three pages of plots. You can put them on the same page by using HPERCENT and VPERCENT. HPERCENT tells what percentage of the horizontal side should be allocated to each plot. VPERCENT does the same for the vertical dimension. The following lines of code tell SAS to produce three side-by side plots on the same page (33% = 1/3).

PROC PLOT DATA=all HPERCENT=33 33 33;
 PLOT width*length width*height height*length;

The result isn't too informative.

   WIDTH*LENGTH.         WIDTH*HEIGHT.        HEIGHT*LENGTH.

WIDTH |               WIDTH |               HEIGHT |
      |                     |                      |
  140 +                 140 +                   70 +
      |         A           |       A              |        A
      |        A            |      A               |       C
  120 +        B        120 +      B            60 +       B
      |        B            |      B               |       C
      |       DA            |    ACA               |      A
  100 +       G         100 +    BE             50 +      G
      |      KA             |   EFA                |      G
      |     FD              |   IA                 |     F
   80 +     G            80 +  BE               40 +     J
      |     A               |   A                  |    DD
      |                     |                      |
   60 +                  60 +                   30 +
      |                     |                      |
      ++----+----+          ++--+--+--+            +--------+
       0   100 200          20 40 60 80             0      200

By using both HPERCENT and VPERCENT, a page can be divided both horizontally and vertically, as shown below.

PROC PLOT DATA=all HPERCENT=50 50 VPERCENT=50 50; 
 PLOT width*length width*height height*length;

 A=1, B=2, etc. WIDTH*LENGTH.      A=1, B=2, etc. WIDTH*HEIGHT.

WIDTH |                           WIDTH |
  140 +                  A          140 +                A
  120 +               DA            120 +              BC
  100 +         CEDHAC              100 +       CCFBFBB
   80 +      CFECA                   80 +     BFGC
      |                                 |
      ++------+------+------+-          ++------+------+------+-
      50     100    150    200          20     40     60     80
               LENGTH                            HEIGHT


A=1, B=2, etc. HEIGHT*LENGTH.

HEIGHT |
    80 +
    60 +          A EACDA A
    40 +      CFEFEDC
    20 +
       |
       ++------+------+------+
       50     100    150   200
                LENGTH

Charts

In SAS, PROC CHART can be used to create bar graphs to illustrate categorical or numeric variables. For the following examples, add an ordinal categorical variable to the turtle data as follows:

DATA all;
 SET all;
 IF length<=120 THEN size=1;
 ELSE IF 120<length<=140 THEN size=2;
 ELSE IF length>140 THEN size=3;

Use FORMAT statements to make the plots look nice:

PROC FORMAT;
 VALUE sizefmt 1='Small' 2='Medium' 3='Large';

Then, the following statements will produce a vertical bar (VBAR) chart which will show the percentage of turtles in the three size categories. The MIDPOINTS statement is used to specify that 1, 2, and 3 are ordered categories and not continuous measurements.

PROC CHART DATA=all;
 VBAR size/TYPE=PERCENT MIDPOINTS=1 2 3;
 FORMAT size sizefmt.;

This produces the following chart.

Percentage

   |       *****
   |       *****
40 +       *****
   |       *****
   |       *****       *****
   |       *****       *****
20 +       *****       *****       *****
   |       *****       *****       *****
   |       *****       *****       *****
   |       *****       *****       *****
   +-------------------------------------------
           Small      Medium       Large
                   
SIZE Midpoint

By changing VBAR to HBAR in the statements above, SAS produces the following horizontal bar graph with information on frequency, cumulative frequency, percentage and cumulative percentage .


SIZE                                      Cum.              Cum.
Midpoint                            Freq  Freq  Percent  Percent
         |
Small    |************************    23    23    47.92    47.92
         |
Medium   |****************            15    38    31.25    79.17
         |
Large    |**********                  10    48    20.83   100.00
         |
         -----+----+----+----+----
                  10   20   30   40

                Percentage

Differences between male and female turtles can be shown as follows:

PROC CHART DATA=all;
 VBAR size/TYPE=PERCENT MIDPOINTS=1 2 3 SUBGROUP=gender;
 FORMAT size sizefmt.;

Percentage

50 +       MMMMM
   |       MMMMM
40 +       MMMMM
   |       MMMMM       MMMMM
30 +       MMMMM       MMMMM
   |       MMMMM       MMMMM
20 +       MMMMM       FFFFF       FFFFF
   |       MMMMM       FFFFF       FFFFF
10 +       FFFFF       FFFFF       FFFFF
   |       FFFFF       FFFFF       FFFFF
   +-------------------------------------------
           Small      Medium       Large
                   SIZE Midpoint

Symbol GENDER     Symbol GENDER
   F   Female        M   Male

Side-by-side comparisons can be made by using GROUP instead of SUBGROUP in the statements above.

Percentage

   |                         *****
   |                         *****
   |                         *****
30 +                         *****
   |                         *****
   |                         *****
   |                         *****
20 +         *****  *****    *****
   |         *****  *****    *****
   |         *****  *****    *****
   |         *****  *****    *****  *****
10 +  *****  *****  *****    *****  *****
   |  *****  *****  *****    *****  *****
   |  *****  *****  *****    *****  *****
   |  *****  *****  *****    *****  *****
  
+------------------------------------------------------------
      Small Medium  Large    Small Medium  Large   SIZE
Midpoint

      +----- Female -----+   +------ Male ------+  GENDER

PROC CHART can also produce block (3-dimensional), pie chart (by changing VBAR to PIE in the input statement), but it may look terrible. The input

PROC CHART;
PIE size/TYPE=PERCENT MIDPOINTS=1 2 3;
 FORMAT size sizefmt.;

produces

                       Small
                    *********
                ****         ****
              **                 **
             *                     *
           **           23          **
          **          47.92%         **
          *                           *
         *                             *
         * .. . ..                     *
         *         . .  +  . . .. . .. *
         *                             *
         *               .             *
          *       15     .    10      *
          **    31.25%    . 20.83%   **
           **             .         **
      Medium *             .       *  Large
              **           .     **
                ****       . ****
                    *********
   


Homework problems for this lesson

Return to STA 5106 home page