PROC CONTENTS provides a useful listing of information about a dataset, such as the names and formats of the variables. For the following example, data were obtained from Foundations of Education, 5th Edition by A. C. Ornstein and D. U. Levine (Houghton Mifflin, Boston, 1993). The variables represent countries, the education expenditures per student in American dollars, gross domestic product per capita in American dollars, and ratio of pupils to teachers.
DATA finances; INPUT @1 country $15. @16 educ_exp comma5. @21 gdp comma6.@27 ptratio; FORMAT educ_exp gdp dollar7.; LABEL country='Country' educ_exp='Expenditures' gdp='Domestic Product' ptratio='Pupil/Teacher'; DATALINES; Canada 4,05417,35516.2 United States 3,39818,29718.4 United Kingdom 2,47412,52915.7 Australia 2,06013,52315.7 ; PROC PRINT DATA=finances; PROC CONTENTS DATA=finances; RUN;
SAS produces the following output.
OBS COUNTRY EDUC_EXP GDP PTRATIO
1 Canada $4,054 $17,355 16.2
2 United States $3,398 $18,297 18.4
3 United Kingdom $2,474 $12,529 15.7
4 Australia $2,060 $13,523 15.7
CONTENTS PROCEDURE
Data Set Name: WORK.FINANCES
Observations: 4
Member Type: DATA
Variables: 4
Engine: V612
Indexes: 0
Created: 9:22 Fri, Dec 28, 2001
Observation Length: 39
Last Modified: 9:22 Fri, Dec 28, 2001
Deleted Observations: 0
Protection:
Compressed: NO
Data Set Type:
Sorted: NO
Label:
-----Engine/Host Dependent Information-----
Data Set Page Size: 8192
Number of Data Set Pages: 1
File Format: 607
First Data Page: 1
Max Obs per Page: 208
Obs in First Data Page: 4
-----Alphabetic List of Variables and
Attributes-----
# Variable Type Len Pos Format Label
1 COUNTRY Char 15 0 Country
2 EDUC_EXP Num 8 15 DOLLAR7. Expenditures
3 GDP Num 8 23 DOLLAR7. Domestic Product
4 PTRATIO Num 8 31 Pupil/Teacher
While manipulating existing datasets to form new ones, you will often need to go through several steps, and it is easy to lose track of the operations you have performed. One way to get your bearings is to print the dataset formed at each step of the process. You could also use PROC CONTENTS if you only need to verify that certain variables are contained in a dataset.
When SAS creates a dataset, it does so by completing one observation before moving on to the next one. Sometimes, you may want to carry the values from one observation to the next. This can be done with RETAIN statements.
Suppose that someone who didn't follow the directions for working with spreadsheets in Lesson 3 gave you the following data on supermarket prices:
| Store | Date | Bananas ($/lb) | Broccoli ($/head) | Blueberries ($/pt) |
| Thrift E Mart | 8/4/97 | .39 | 1.19 | 1.39 |
| 12/18/97 | .49 | 1.09 | 1.99 | |
| Shop n Save | 8/12/97 | .33 | .99 | 1.09 |
| 12/1/97 | .49 | 1.09 | 2.29 | |
| Stop n Spend | 8/13/97 | .49 | 1.09 | 1.59 |
| 12/14/97 | .59 | 1.29 | 2.99 |
Look at the following SAS program.
DATA grocery;
INPUT @1 store $15. @16 date mmddyy8. @25 banana
broccoli bluberry;
DATALINES;
Thrift E Mart 8/4/97 .39 1.19 1.39
12/18/97 .49 1.09 1.99
Shop n Save 8/12/97 .33 .99 1.09
12/1/97 .49 1.09 2.29
Stop n Spend 8/13/97 .49 1.09 1.59
12/14/97 .59 1.29 2.99
;
PROC PRINT DATA=grocery;
RUN;
The following output is produced (or is that produce? Ha!):
OBS STORE DATE BANANA BROCCOLI BLUBERRY
1 Thrift E Mart 13730 0.39 1.19 1.39
2 13866 0.49 1.09 1.99
3 Shop n Save 13738 0.33 0.99 1.09
4 13849 0.49 1.09 2.29
5 Stop n Spend 13739 0.49 1.09 1.59
6 13862 0.59 1.29 2.99
SAS can fill in the missing store names with a RETAIN statement. In the lines of code below, TEMPGROC ("temporary grocery") is a new SAS variable. RETAIN tells SAS to keep the value of TEMPGROC from the previous observation as it moves to the next observation. If the store name is not missing, as in observations 1, 3, and 5, the program assigns that store name to TEMPGROC. Then, if the store name for an observation is missing, as in observations 2, 4, and 6, the store name can be obtained from the previous observation by using TEMPGROC.
DATA grocery; SET grocery; RETAIN tempgroc; IF store^=' ' THEN tempgroc=store; ELSE IF store=' ' THEN store=tempgroc; PROC PRINT;The output is now:
OBS STORE DATE BANANA BROCCOLI BLUBERRY TEMPGROC 1 Thrift E Mart 13730 0.39 1.19 1.39 Thrift E Mart 2 Thrift E Mart 13866 0.49 1.09 1.99 Thrift E Mart 3 Shop n Save 13738 0.33 0.99 1.09 Shop n Save 4 Shop n Save 13849 0.49 1.09 2.29 Shop n Save 5 Stop n Spend 13739 0.49 1.09 1.59 Stop n Spend 6 Stop n Spend 13862 0.59 1.29 2.99 Stop n SpendYou may wish to create a new data set dropping TEMPGROC by DROP.
DATA grocery;SET grocery;DROP tempgroc;
DO loops can be used to create an ordered sequence of numbers. For example, consider the following simple program.
DATA example; DO q=1 TO 5; qtimes2=q*2; qsquared=q**2; OUTPUT; END; PROC PRINT DATA=example; run;
This produces the following output:
OBS Q QTIMES2 QSQUARED 1 1 2 1 2 2 4 4 3 3 6 9 4 4 8 16 5 5 10 25
The program "loops" through the values of Q from 1 to 5 and performs the calculations requested for the current value of Q. The OUTPUT statement tells SAS to export Q and the new variables to the dataset EXAMPLE. The END statement signifies the end of the loop. An END statement is necessary for each DO statement! Notice that neither INPUT nor DATALINES statements are used.
In the example above, q assumed all of the values from 1 to 5. The following are also acceptable DO statements.
DO q=1 TO 5 BY 2; /* q assumes values 1, 3, & 5 */ DO q=5 TO 1 BY -1; /* q assumes values 5, 4, 3, 2, & 1 */ DO q=3,9; /* q assumes values 3 & 9 */ DO q=1 TO 5, 7, 13; /* q assumes values 1, 2, 3, 4, 5, 7, & 13 */ DO q='a','b','c'; /* q assumes character values a, b, & c */
Nested DO loops can also be used. For example, suppose that you want to use SAS to list all of the cards in a 52-card deck. The following program could be used. The FORMAT statement is necessary; otherwise, VALUE would be limited to three characters (since 'ace,' the first VALUE listed, has three characters) and SUIT would have only six characters ('spades'). Notice that there are two END statements.
DATA deck; FORMAT value $5. suit $8.; DO value='ace','2','3','4','5','6','7','8','9','10', 'jack','queen','king'; DO suit='spades','hearts','clubs','diamonds'; OUTPUT; END; END; PROC PRINT DATA=deck; RUN;The first few outputs are:
1 ace spades
2 ace hearts
3 ace clubs
4 ace diamonds
5 2 spades
6 2 hearts
......
6 jac diamon
......
DO loops are also used with arrays. An array is a temporary holding site for a collection of variables upon which the same operations will be performed. Arrays provide convenient shortcuts in programming.
Suppose that a horticulturist wants to investigate whether environmental conditions affect the color of watermelon flesh. He plants 50 watermelons of the same variety in each of 3 locations on the same date, then harvests one melon from each plant on the same date months later. The responses of interest are the fractions of watermelons falling in these color categories: light pink, pink, salmon, and red. Since some watermelon seeds do not bear fruit, another variable for nonbearing plants should be added. The following DATA step calculates these fractions from the original melon counts.
DATA melons; INPUT location $ ltpink pink salmon red; nomelons=50-(ltpink+pink+salmon+red); ltpink=ltpink/50; pink=pink/50; salmon=salmon/50; red=red/50; nomelons=nomelons/50; DATALINES; North 3 14 16 8 East 8 23 9 2 South 0 4 10 19 ;
Notice the repetition of steps in the program. A more efficient way to accomplish the same task is shown below.
DATA melons; INPUT location $ ltpink pink salmon red; nomelons=50-(ltpink+pink+salmon+red); ARRAY colors (5) ltpink pink salmon red nomelons; DO i=1 TO 5; colors(i)=colors(i)/50; END; DATALINES; North 3 14 16 8 East 8 23 9 2 South 0 4 10 19 ; PROC PRINT DATA=melons; RUN;
This program produces the following output.
OBS LOCATION LTPINK PINK SALMON RED NOMELONS I 1 North 0.06 0.28 0.32 0.16 0.18 6 2 East 0.16 0.46 0.18 0.04 0.16 6 3 South 0.00 0.08 0.20 0.38 0.34 6
The array COLORS is used as a temporary holding site for the five categories. In the order specified in the ARRAY statement, COLORS(1) is LTPINK, COLORS(2) is pink, and so forth. The DO loop tells SAS to perform the same operation to all of the variables represented by COLORS(1), COLORS(2), ..., COLORS(5). The variable I, with the next value in the DO LOOP, is added to the dataset, but no COLORS variables are added.
For long lists of variables, you may want to use the shortcut of specifying ranges of variables with hyphens. If the variables are consecutively numbered, with the number appearing as the last part of the variable name, then you may refer to all of them by listing the first variable name, a hyphen, and the final variable name. Otherwise, list the first variable name, two hyphens, and the last variable name. Two examples of different ways to create the watermelon data are shown below. Notice that the SUM(OF ...) syntax is used to calculate the number of nonbearing plants.
DATA melons; INPUT location $ red1 red2 red3 red4; red5=50-(sum(of red1-red4)); ARRAY colors (5) red1-red5; DO i=1 TO 5; colors(i)=colors(i)/50; END; DATALINES; North 3 14 16 8 East 8 23 9 2 South 0 4 10 19 ; DATA melons; INPUT location $ ltpink pink salmon red; nomelons=50-(sum(of ltpink--red)); ARRAY colors (5) ltpink--nomelons; DO i=1 TO 5; colors(i)=colors(i)/50; END; DATALINES; North 3 14 16 8 East 8 23 9 2 South 0 4 10 19 ;
You will probably encounter data in which several measurements are taken on the same observation. For example, you may record the age, height, and weight of a person, or you may record the height of a tree at monthly intervals. When using SAS, you may wonder whether to write all of this information on one line of data (in so-called multivariate format) or to list each measurement in its own line of data (in univariate format). This depends on the situation. For example, when analyzing repeated-measures data such as the heights of the trees, the REPEATED command in PROC GLM requires the data to be in multivariate format. However, when using PROC MIXED for the same measurements, they must be in univariate format. Fortunately, you can transform each format into the other using SAS.
Consider the following example. Suppose that a statistics instructor uses SAS to keep track of his students' weekly homework grades. He may enter them in multivariate format as follows:
DATA grades1; INPUT name $ hw1-hw6; DATALINES; AMY 6 8 4 8 7 8 Bob 3 5 5 7 5 5 Carol 7 8 8 7 8 9 Dave 10 9 9 8 10 9 Eve 6 6 8 6 9 7 ;
Suppose that the teacher wanted to use the correlation of grades with time for each student as a measure of improvement during the semester. To do this, the data would need to be rearranged into univariate format, with each line of data containing the name of the student, the week number, and the grade. One way to do this is shown below.
DATA grades2; SET grades1; week=1; grade=hw1; OUTPUT; week=2; grade=hw2; OUTPUT; week=3; grade=hw3; OUTPUT; week=4; grade=hw4; OUTPUT; week=5; grade=hw5; OUTPUT; week=6; grade=hw6; OUTPUT; KEEP name week grade; PROC PRINT DATA=grades2;The output is as follows. We can use it to find the correlation between week and grades (in lesson 10).
OBS NAME WEEK GRADE
1 Amy 1 6
2 Amy 2 8
3 Amy 3 4
4 Amy 4 8
5 Amy 5 7
6 Amy 6 8
7 Bob 1 3
8 Bob 2 5
9 Bob 3 5
10 Bob 4 7
11 Bob 5 5
12 Bob 6 5
13 Carol 1 7
............................
Don't forget about arrays! This code does the same thing.
DATA grades2; SET grades1; ARRAY homework (6) hw1-hw6; DO i=1 TO 6; week=i; grade=homework(i); OUTPUT; END; KEEP name week grade;
PROC TRANSPOSE may also be used to convert multivariate format to univariate format. The data must first be sorted by the variables which identify the subject. Then, PROC TRANSPOSE can be used to make a new dataset. Refer to the example below.
PROC SORT DATA=grades1; BY name; PROC TRANSPOSE DATA=grades1 OUT=grades2; BY name; VAR hw1-hw6; PROC PRINT DATA=grades2; RUN;
This produces the following output.
OBS NAME _NAME_ COL1 1 Amy HW1 6 2 Amy HW2 8 3 Amy HW3 4 4 Amy HW4 8 5 Amy HW5 7 6 Amy HW6 8 (lines deleted) 25 Eve HW1 6 26 Eve HW2 6 27 Eve HW3 8 28 Eve HW4 6 29 Eve HW5 9 30 Eve HW6 7
COL1, or Column 1, contains the transposed grades. NAME is the original student name furnished in the DATA step, and _NAME_ shows the variable from which each grade was constructed. The following code shows how to rename COL1, discard _NAME_, and how to attach the week number to each grade.
DATA grades2; SET grades2(DROP=_name_ RENAME=(col1=grade)); BY name; RETAIN week; IF first.name=1 THEN week=0; week=week+1; PROC PRINT;The output is now:
OBS NAME WEEK GRADE
1 Amy 1 6
2 Amy 2 8
3 Amy 3 4
4 Amy 4 8
5 Amy 5 7
6 Amy 6 8
7 Bob 1 3
8 Bob 2 5
...........................
Note that first.name means the first data in the name after the data is sorted by name.
PROC TRANSPOSE must be used to transfer from univariate format to multivariate format. The dataset GRADES2 created in the step above contains the variables NAME, GRADE, and WEEK, with 30 observations (5 students times 6 weeks). To transpose it back to its original format, the following statements could be used. The PREFIX option replaces the SAS default (COL) with HW.
PROC SORT DATA=grades2; BY name week; PROC TRANSPOSE DATA=grades2 OUT=grades3 PREFIX=hw; VAR grade; BY name;
GRADES3 is identical to the original dataset GRADES1, except that GRADES3 contains the variable _NAME_ with the value "GRADE" for all 6 observations. The output of grades3 is as follows:
NAME _NAME_ HW1 HW2 HW3 HW4 HW5 HW6 Amy GRADE 6 8 4 8 7 8 Bob GRADE 3 5 5 7 5 5 Carol GRADE 7 8 8 7 8 9 Dave GRADE 10 9 9 8 10 9 Eve GRADE 6 6 8 6 9 7