Lesson 6: Modifying Datasets (Part 2)

PROC CONTENTS

PROC CONTENTS provides a useful listing of information about a dataset, such as the names and formats of the variables. For the following example, data were obtained from Foundations of Education, 5th Edition by A. C. Ornstein and D. U. Levine (Houghton Mifflin, Boston, 1993). The variables represent countries, the education expenditures per student in American dollars, gross domestic product per capita in American dollars, and ratio of pupils to teachers.

DATA finances;
 INPUT @1 country $15. @16 educ_exp comma5.
 @21 gdp comma6.@27 ptratio;
 FORMAT educ_exp gdp dollar7.;
 LABEL country='Country'
  educ_exp='Expenditures'
  gdp='Domestic Product'
  ptratio='Pupil/Teacher';
 DATALINES;
Canada         4,05417,35516.2
United States  3,39818,29718.4
United Kingdom 2,47412,52915.7
Australia      2,06013,52315.7
;
PROC PRINT DATA=finances;
PROC CONTENTS DATA=finances;
RUN;

SAS produces the following output.

OBS    COUNTRY           EDUC_EXP       GDP    PTRATIO

 1     Canada             $4,054    $17,355      16.2
 2     United States      $3,398    $18,297      18.4
 3     United Kingdom     $2,474    $12,529      15.7
 4     Australia          $2,060    $13,523      15.7

CONTENTS PROCEDURE

Data Set Name: WORK.FINANCES           
Observations:         4
Member Type:          DATA                    
Variables:            4
Engine:               V612                    
Indexes:              0
Created:              9:22 Fri, Dec 28, 2001  
Observation Length:   39
Last Modified:        9:22 Fri, Dec 28, 2001  
Deleted Observations: 0
Protection:                            
Compressed:           NO
Data Set Type:                         
Sorted:               NO
Label:

-----Engine/Host Dependent Information-----

Data Set Page Size:       8192
Number of Data Set Pages: 1
File Format:              607
First Data Page:          1
Max Obs per Page:         208
Obs in First Data Page:   4

    -----Alphabetic List of Variables and
Attributes-----

#   Variable   Type   Len   Pos   Format    Label
1   COUNTRY    Char    15     0             Country
2   EDUC_EXP   Num      8    15   DOLLAR7.  Expenditures
3   GDP        Num      8    23   DOLLAR7.  Domestic Product
4   PTRATIO    Num      8    31             Pupil/Teacher

While manipulating existing datasets to form new ones, you will often need to go through several steps, and it is easy to lose track of the operations you have performed. One way to get your bearings is to print the dataset formed at each step of the process. You could also use PROC CONTENTS if you only need to verify that certain variables are contained in a dataset.

Retaining values for variables

When SAS creates a dataset, it does so by completing one observation before moving on to the next one. Sometimes, you may want to carry the values from one observation to the next. This can be done with RETAIN statements.

Suppose that someone who didn't follow the directions for working with spreadsheets in Lesson 3 gave you the following data on supermarket prices:
Store Date Bananas ($/lb) Broccoli ($/head) Blueberries ($/pt)
Thrift E Mart 8/4/97 .39 1.19 1.39
12/18/97 .49 1.09 1.99
Shop n Save 8/12/97 .33 .99 1.09
12/1/97 .49 1.09 2.29
Stop n Spend 8/13/97 .49 1.09 1.59
12/14/97 .59 1.29 2.99

Look at the following SAS program.

DATA grocery;
 INPUT @1 store $15. @16 date mmddyy8. @25 banana 
  broccoli bluberry;
 DATALINES;
Thrift E Mart  8/4/97   .39 1.19 1.39
               12/18/97 .49 1.09 1.99
Shop n Save    8/12/97  .33  .99 1.09
               12/1/97  .49 1.09 2.29
Stop n Spend   8/13/97  .49 1.09 1.59
               12/14/97 .59 1.29 2.99
;
PROC PRINT DATA=grocery;
RUN;

The following output is produced (or is that produce? Ha!):

    OBS        STORE         DATE    BANANA    BROCCOLI    BLUBERRY

     1     Thrift E Mart    13730     0.39       1.19        1.39  
     2                      13866     0.49       1.09        1.99  
     3     Shop n Save      13738     0.33       0.99        1.09  
     4                      13849     0.49       1.09        2.29  
     5     Stop n Spend     13739     0.49       1.09        1.59  
     6                      13862     0.59       1.29        2.99 

SAS can fill in the missing store names with a RETAIN statement. In the lines of code below, TEMPGROC ("temporary grocery") is a new SAS variable. RETAIN tells SAS to keep the value of TEMPGROC from the previous observation as it moves to the next observation. If the store name is not missing, as in observations 1, 3, and 5, the program assigns that store name to TEMPGROC. Then, if the store name for an observation is missing, as in observations 2, 4, and 6, the store name can be obtained from the previous observation by using TEMPGROC.

DATA grocery;
 SET grocery;
 RETAIN tempgroc;
 IF store^=' ' THEN tempgroc=store;
  ELSE IF store=' ' THEN store=tempgroc;
PROC PRINT;
The output is now:
  OBS      STORE       DATE  BANANA  BROCCOLI  BLUBERRY    TEMPGROC

   1   Thrift E Mart  13730   0.39     1.19      1.39    Thrift E Mart
   2   Thrift E Mart  13866   0.49     1.09      1.99    Thrift E Mart
   3   Shop n Save    13738   0.33     0.99      1.09    Shop n Save  
   4   Shop n Save    13849   0.49     1.09      2.29    Shop n Save  
   5   Stop n Spend   13739   0.49     1.09      1.59    Stop n Spend 
   6   Stop n Spend   13862   0.59     1.29      2.99    Stop n Spend 
You may wish to create a new data set dropping TEMPGROC by DROP.
DATA grocery;SET grocery;DROP tempgroc;

DO loops

DO loops can be used to create an ordered sequence of numbers. For example, consider the following simple program.

DATA example;
 DO q=1 TO 5;
  qtimes2=q*2;
  qsquared=q**2;
  OUTPUT;
 END;
PROC PRINT DATA=example;
run;

This produces the following output:

OBS    Q    QTIMES2    QSQUARED

 1     1        2          1
 2     2        4          4
 3     3        6          9
 4     4        8         16
 5     5       10         25

The program "loops" through the values of Q from 1 to 5 and performs the calculations requested for the current value of Q. The OUTPUT statement tells SAS to export Q and the new variables to the dataset EXAMPLE. The END statement signifies the end of the loop. An END statement is necessary for each DO statement! Notice that neither INPUT nor DATALINES statements are used.

In the example above, q assumed all of the values from 1 to 5. The following are also acceptable DO statements.

DO q=1 TO 5 BY 2;   /* q assumes values 1, 3, & 5 */ 
DO q=5 TO 1 BY -1;  /* q assumes values 5, 4, 3, 2, & 1 */ 
DO q=3,9;           /* q assumes values 3 & 9 */
DO q=1 TO 5, 7, 13; /* q assumes values 1, 2, 3, 4, 5, 7, & 13
*/ 
DO q='a','b','c';   /* q assumes character values a, b, & c */

Nested DO loops can also be used. For example, suppose that you want to use SAS to list all of the cards in a 52-card deck. The following program could be used. The FORMAT statement is necessary; otherwise, VALUE would be limited to three characters (since 'ace,' the first VALUE listed, has three characters) and SUIT would have only six characters ('spades'). Notice that there are two END statements.

DATA deck;
 FORMAT value $5. suit $8.;
 DO  value='ace','2','3','4','5','6','7','8','9','10',
 'jack','queen','king';
  DO suit='spades','hearts','clubs','diamonds';
   OUTPUT;
  END;
 END;
PROC PRINT DATA=deck;
RUN;
The first few outputs are:
       1    ace      spades  
       2    ace      hearts  
       3    ace      clubs   
       4    ace      diamonds
       5    2        spades  
       6    2        hearts 
       ......
       6    jac      diamon
       ......

Arrays

DO loops are also used with arrays. An array is a temporary holding site for a collection of variables upon which the same operations will be performed. Arrays provide convenient shortcuts in programming.

Suppose that a horticulturist wants to investigate whether environmental conditions affect the color of watermelon flesh. He plants 50 watermelons of the same variety in each of 3 locations on the same date, then harvests one melon from each plant on the same date months later. The responses of interest are the fractions of watermelons falling in these color categories: light pink, pink, salmon, and red. Since some watermelon seeds do not bear fruit, another variable for nonbearing plants should be added. The following DATA step calculates these fractions from the original melon counts.

DATA melons;
 INPUT location $ ltpink pink salmon red;  
 nomelons=50-(ltpink+pink+salmon+red);
 ltpink=ltpink/50;
 pink=pink/50;
 salmon=salmon/50;
 red=red/50;
 nomelons=nomelons/50;
 DATALINES;
North 3 14 16 8 
East 8 23 9 2 
South 0 4 10 19 
;

Notice the repetition of steps in the program. A more efficient way to accomplish the same task is shown below.

DATA melons;
 INPUT location $ ltpink pink salmon red;  
 nomelons=50-(ltpink+pink+salmon+red);
 ARRAY colors (5) ltpink pink salmon red nomelons;
 DO i=1 TO 5;
  colors(i)=colors(i)/50;
 END;
 DATALINES;
North 3 14 16 8 
East 8 23 9 2 
South 0 4 10 19
;
PROC PRINT DATA=melons;
RUN;

This program produces the following output.

OBS    LOCATION LTPINK PINK SALMON  RED  NOMELONS  I

 1      North    0.06  0.28  0.32  0.16   0.18     6
 2      East     0.16  0.46  0.18  0.04   0.16     6
 3      South    0.00  0.08  0.20  0.38   0.34     6

The array COLORS is used as a temporary holding site for the five categories. In the order specified in the ARRAY statement, COLORS(1) is LTPINK, COLORS(2) is pink, and so forth. The DO loop tells SAS to perform the same operation to all of the variables represented by COLORS(1), COLORS(2), ..., COLORS(5). The variable I, with the next value in the DO LOOP, is added to the dataset, but no COLORS variables are added.

For long lists of variables, you may want to use the shortcut of specifying ranges of variables with hyphens. If the variables are consecutively numbered, with the number appearing as the last part of the variable name, then you may refer to all of them by listing the first variable name, a hyphen, and the final variable name. Otherwise, list the first variable name, two hyphens, and the last variable name. Two examples of different ways to create the watermelon data are shown below. Notice that the SUM(OF ...) syntax is used to calculate the number of nonbearing plants.

DATA melons;
 INPUT location $ red1 red2 red3 red4;  
 red5=50-(sum(of red1-red4));
 ARRAY colors (5) red1-red5;
 DO i=1 TO 5;
  colors(i)=colors(i)/50;
 END;
 DATALINES;
North 3 14 16 8 
East 8 23 9 2 
South 0 4 10 19 
;

DATA melons;
 INPUT location $ ltpink pink salmon red;  
 nomelons=50-(sum(of ltpink--red));
 ARRAY colors (5) ltpink--nomelons;
 DO i=1 TO 5;
  colors(i)=colors(i)/50;
 END;
 DATALINES;
North 3 14 16 8 
East 8 23 9 2 
South 0 4 10 19
;

Transposing data

You will probably encounter data in which several measurements are taken on the same observation. For example, you may record the age, height, and weight of a person, or you may record the height of a tree at monthly intervals. When using SAS, you may wonder whether to write all of this information on one line of data (in so-called multivariate format) or to list each measurement in its own line of data (in univariate format). This depends on the situation. For example, when analyzing repeated-measures data such as the heights of the trees, the REPEATED command in PROC GLM requires the data to be in multivariate format. However, when using PROC MIXED for the same measurements, they must be in univariate format. Fortunately, you can transform each format into the other using SAS.

Consider the following example. Suppose that a statistics instructor uses SAS to keep track of his students' weekly homework grades. He may enter them in multivariate format as follows:

DATA grades1;
 INPUT name $ hw1-hw6;
 DATALINES;
 AMY    6  8  4  8  7  8
 Bob    3  5  5  7  5  5
 Carol  7  8  8  7  8  9
 Dave  10  9  9  8 10  9
 Eve    6  6  8  6  9  7
;

Suppose that the teacher wanted to use the correlation of grades with time for each student as a measure of improvement during the semester. To do this, the data would need to be rearranged into univariate format, with each line of data containing the name of the student, the week number, and the grade. One way to do this is shown below.

DATA grades2;
 SET grades1;
 week=1; grade=hw1; OUTPUT;
 week=2; grade=hw2; OUTPUT;
 week=3; grade=hw3; OUTPUT;
 week=4; grade=hw4; OUTPUT;
 week=5; grade=hw5; OUTPUT;
 week=6; grade=hw6; OUTPUT;
 KEEP name week grade;
PROC PRINT DATA=grades2;
The output is as follows. We can use it to find the correlation between week and grades (in lesson 10).

                     OBS    NAME     WEEK    GRADE

                       1    Amy        1        6 
                       2    Amy        2        8 
                       3    Amy        3        4 
                       4    Amy        4        8 
                       5    Amy        5        7 
                       6    Amy        6        8 
                       7    Bob        1        3 
                       8    Bob        2        5 
                       9    Bob        3        5 
                      10    Bob        4        7 
                      11    Bob        5        5 
                      12    Bob        6        5 
                      13    Carol      1        7 
                      ............................

Don't forget about arrays! This code does the same thing.

DATA grades2;
 SET grades1;
 ARRAY homework (6) hw1-hw6;
 DO i=1 TO 6;
 week=i; grade=homework(i); OUTPUT;
 END;
 KEEP name week grade;

PROC TRANSPOSE may also be used to convert multivariate format to univariate format. The data must first be sorted by the variables which identify the subject. Then, PROC TRANSPOSE can be used to make a new dataset. Refer to the example below.

PROC SORT DATA=grades1;
 BY name;
PROC TRANSPOSE DATA=grades1 OUT=grades2;
 BY name;
 VAR hw1-hw6;
PROC PRINT DATA=grades2;
RUN;

This produces the following output.

OBS    NAME     _NAME_    COL1
  1    Amy       HW1        6
  2    Amy       HW2        8
  3    Amy       HW3        4
  4    Amy       HW4        8
  5    Amy       HW5        7
  6    Amy       HW6        8
(lines deleted)
 25    Eve       HW1        6
 26    Eve       HW2        6
 27    Eve       HW3        8
 28    Eve       HW4        6
 29    Eve       HW5        9
 30    Eve       HW6        7           

COL1, or Column 1, contains the transposed grades. NAME is the original student name furnished in the DATA step, and _NAME_ shows the variable from which each grade was constructed. The following code shows how to rename COL1, discard _NAME_, and how to attach the week number to each grade.

DATA grades2;
 SET grades2(DROP=_name_ RENAME=(col1=grade));
 BY name;
 RETAIN week;
 IF first.name=1 THEN week=0;
 week=week+1;
 PROC PRINT;
The output is now:

                     OBS    NAME     WEEK    GRADE

                       1    Amy        1        6 
                       2    Amy        2        8 
                       3    Amy        3        4 
                       4    Amy        4        8 
                       5    Amy        5        7 
                       6    Amy        6        8 
                       7    Bob        1        3 
                       8    Bob        2        5 
                       ...........................
Note that first.name means the first data in the name after the data is sorted by name.

PROC TRANSPOSE must be used to transfer from univariate format to multivariate format. The dataset GRADES2 created in the step above contains the variables NAME, GRADE, and WEEK, with 30 observations (5 students times 6 weeks). To transpose it back to its original format, the following statements could be used. The PREFIX option replaces the SAS default (COL) with HW.

PROC SORT DATA=grades2;
 BY name week;
PROC TRANSPOSE DATA=grades2 OUT=grades3 PREFIX=hw;
 VAR grade;
 BY name;

GRADES3 is identical to the original dataset GRADES1, except that GRADES3 contains the variable _NAME_ with the value "GRADE" for all 6 observations. The output of grades3 is as follows:


NAME     _NAME_    HW1    HW2    HW3    HW4    HW5    HW6

Amy      GRADE       6     8      4      8       7     8 
Bob      GRADE       3     5      5      7       5     5 
Carol    GRADE       7     8      8      7       8     9 
Dave     GRADE      10     9      9      8      10     9 
Eve      GRADE       6     6      8      6       9     7
 


Homework problems for this lesson

Return to STA 5106 home page