SAS always creates some automatic variables whenever a new dataset is created. These variables are not apparent unless you explicitly call for them, but they may be invaluable in some situations. _N_ denotes the observation number. This is useful if you need to assign identification numbers to subjects when no other numbers (Social Security, student ID, driver's license, etc.) are available. _ERROR_ tells whether the data for an observation were processed correctly. FIRST.x and LAST.x variables, when used with a BY statement, tell whether the observations in a group are the first or last variables in that group.
Suppose that a veterinarian likes to schedule cats in the morning and dogs in the afternoon. The following SAS program attaches a sequence number to each pet, tells if the time and date of the appointment were entered correctly, and makes indicators for the first and last pets of each species examined.
DATA pets1; INPUT @1 name $9. @10 time time5. @20 date mmddyy8. @30 species $; mistakes=_error_; DATALINES; Fluffy 9:00 02/13/98 cat Tom 10:00 02/13/98 cat Rex 13:00 02/31/98 dog Fido 14:00 02/13/98 dog Felix 9:30 02/13/98 cat Spot 15:00 02/13/98 dog ; PROC SORT DATA=pets1; BY species time; /* First listed species has priority in sorting than time. */ DATA pets1; SET pets1; BY species; pet_num=_n_; firstgrp=first.species; lastgrp=last.species; PROC PRINT DATA=pets1; VAR pet_num name species time date mistakes firstgrp lastgrp; FORMAT time time5. date mmddyy8.; RUN;
Notice that this program uses the SET command to make modifications to the dataset PETS1. This will overwrite the existing dataset PETS1. The program produces the following output. Both DATE and MISTAKES show that the date entered for REX (February 31, 1998) is incorrect.
M F
P S I I L
E P S R A
T E T S S
_ N C T D A T T
O N A I I A K G G
B U M E M T E R R
S M E S E E S P P
1 1 Fluffy cat 9:00 02/13/98 0 1 0
2 2 Felix cat 9:30 02/13/98 0 0 0
3 3 Tom cat 10:00 02/13/98 0 0 1
4 4 Rex dog 13:00 . 1 1 0
5 5 Fido dog 14:00 02/13/98 0 0 0
6 6 Spot dog 15:00 02/13/98 0 0 1
You will often need to use SAS to calculate new variables based on variables that are in an existing dataset. For example, you may want to convert numbers into different units of measurement, or you may decide to apply a transformation to data when using analysis of variance or regession. Mathematical functions of existing variables can be defined in a DATA step before the DATALINES statement or immediately after the INFILE statement. A common SAS mistake made by beginners is trying to define new variables within PROC steps; instead, this must be done within DATA steps.
Addition, subtraction, multiplication, and division are specified by +, -, *, and /, respectively. For exponentiation, a double asterisk is used. Parentheses can be used to group expressions, and these expressions can be nested within several levels. SAS follows the standard order of operation, or () ** * / + - , for evaluating functions.
For an example, consider the following data on computer prices which were recently published in advertisements in the Gainesville Sun. The original variables represent the computer vendor, the hard drive memory (bytes), modem speed (baud), monitor size across the diagonal (inches), and the price before tax (dollars).
DATA computer;
INPUT @1 vendor $16. @17 harddriv modem monitor price;
/* Convert the hard drive memory to units of gigabytes. */
gigabyte=harddriv/1e9;
/* Convert the modem speed to kilobaud. */
kilobaud=modem/1000;
/* Assuming that the horizontal:vertical ratio of the
monitor
is 4:3, calculate the perimeter and area of the
monitor screen. Convert these to centimeters and
square centimeters, respectively. */
perimetr = (2*(3/5)*monitor + 2*(4/5)*monitor)*2.54;
area = (3/5)*monitor*(4/5)*monitor)*(2.54**2);
DATALINES;
Computer Doctor 3.5e9 . 14 1399
Micro-Center 2e9 33600 15 1598
DogBytes 5.25e9 56000 . 1518
Metalcomp 6.4e9 56000 17 2195
;
PROC PRINT DATA=computer NOOBS;
VAR vendor price gigabyte kilobaud perimetr area;
FORMAT price dollar5. perimetr area 3.;
RUN;
This produces the following output. Notice that new variables which must be calculated from missing values are also determined to be missing.
VENDOR PRICE GIGABYTE KILOBAUD PERIMETR AREA Computer Doctor $1399 3.50 . 100 303 Micro-Center $1598 2.00 33.6 107 348 DogBytes $1518 5.25 56.0 . . Metalcomp $2195 6.40 56.0 121 447
Many mathematical functions are contained in SAS. In statistical analyses, the log and square root functions are often used to transform data. LOG(x) calculates the natural (base e) logarithm of x. LOG10(x) and LOG2(x) calculate logarithms with bases 10 and 2, respectively. The square root function is SQRT(x). Examples are shown below.
logbact=log(bacteria); stderror=sqrt(variance/sampsize);
For working with dates, the MDY function is useful. For example,
birthday=mdy(b_mo, b_da,b_yr);
converts the month, day, and year of birth into a SAS date. The time elapsed (in days) can be calculated by subtracting one date value from another, and these can then be converted to weeks, hours, etc. by multiplying those differences by the appropriate factor.
SAS has many other functions to perform various calculations for trigonometry, finance, and other applications. To find the one you need, you may consult SAS documentation, or you can try to find it in the online help directory in SAS. Warning: SAS "help"doesn't. Looking up a topic in SAS is like looking up how to spell a word in the dictionary; you have to know how to spell the word in order to find it. For example, suppose you want to see how to calculate the absolute value of a variable in SAS. This would be useful for listing outliers in a data analysis problem. Within the SAS session, choose Help, then Extended help, or click on the icon of a book with a question mark. Next, click on the Contents tab and choose SAS System Help: Main Menu. From there, choose SAS Language, then SAS Functions, then Function Categories. Finally, choose Arithmetic Functions. You will then see a list of functions, the first of which is ABS - the absolute value function. Instead of clicking on the Contents tab to start, you could try clicking on the Index tab and typing "absolute value" in the search box. This will not lead you to anything. In fact, if you look up ABS in the Index, you will find help on PROC SQL, a SAS procedure that has nothing to do with absolute values!
SAS can also evaluate logical expressions. These are usually performed equalities and inequalities and with IF-THEN-ELSE statements which specify what action to take under each circumstance. The logical operators AND (&) and OR (|) may also be used to combine two or more expressions, and NOT (^) can be used to negate a statement. Examples of these statements are shown below.
DATA computer;
INPUT @1 vendor $16. @17 harddriv modem monitor price;
/* A customer wants to avoid DogBytes. */
IF vendor='DogBytes' THEN avoid=1;
ELSE avoid=0;
/* Equivalently, use the following:
if vendor^='DogBytes' then avoid=0;
else avoid=1; */
/* The customer wants lots of disk space. */
IF harddriv<=2e9 THEN diskspac=1;
ELSE IF 2e9<harddriv<=4e9 THEN diskspac=2;
ELSE IF harddriv>4e9 THEN diskspac=3;
/* Check the price range and monitor size simultaneously. */
IF (price<=1600 & monitor>=15) THEN goodbuy='Yes';
ELSE goodbuy='No ';
DATALINES;
Computer Doctor 3.5e9 . 14 1399
Micro-Center 2e9 33600 15 1598
DogBytes 5.25e9 56000 . 1518
Metalcomp 6.4e9 56000 17 2195
;
PROC FORMAT;
VALUE ynfmt 0='No' 1='Yes';
VALUE diskfmt 1='Too low' 2='High' 3='Very high';
PROC PRINT DATA=computer NOOBS;
VAR vendor avoid diskspac goodbuy;
FORMAT avoid ynfmt. diskspac diskfmt.;
RUN;
This produces the following output. Remember that SAS regards a missing number to be lower than any other number , so the missing monitor size for DogBytes is <15, not >=15. Consequently, the DogBytes computer is not a "good buy," even though its price is below $1600.
VENDOR AVOID DISKSPAC GOODBUY Computer Doctor No High No Micro-Center No Too low Yes DogBytes Yes Very high No Metalcomp No Very high No
You may only need to use a few variables from a larger dataset. This can be done with KEEP or DROP statements. Very rarely, you may need to do this to conserve memory or save processing time with large datasets. For many problems, you can keep unneeded variables in the dataset, and SAS can handle them with ease. Consider the following example:
DATA computer;
INPUT @1 vendor $16. @17 harddriv modem monitor price;
DATALINES;
Computer Doctor 3.5e9 . 14 1399
Micro-Center 2e9 33600 15 1598
DogBytes 5.25e9 56000 . 1518
Metalcomp 6.4e9 56000 17 2195
;
DATA costdata;
SET computer;
KEEP vendor price;
/* Equivalently, use the following:
drop harddriv modem monitor; */
RUN;
Here, the SET term is used to make a copy of the dataset COMPUTER called COSTDATA, and COSTDATA is subsequently modified with the KEEP statement.
More frequently, you will need to eliminate some observations from a dataset. When performing statistical calculations, SAS will, by default, use all of the observations that are in the dataset. IF statements specify which observations to keep, while IF and THEN DELETE will delete observations. See the following example.
DATA cheap;
SET computer;
IF price<=1600;
/* Equivalently, you could use:
IF price>1600 then delete; */
PROC PRINT DATA=cheap;
RUN;
If you want to apply a SAS PROC to selected observations in a dataset, you do not have to create a new dataset. Instead, you can use a WHERE statement after the PROC statement. WHERE statements are like IF statements; they require logical expressions to specify which observations to use. The following lines of code produce the same results as the program segment above.
PROC PRINT DATA=computer; WHERE price<=1600; RUN;
When using PROC PRINT with a WHERE statement, the OBS column of the printout refers to the observation number in the full dataset and not the sequence of observations selected in the WHERE statement. Thus, it is possible for the OBServation numbers to be unevenly spaced.
There are two different ways to combine datasets. The SET statement is used to add observations to a dataset, while the MERGE statement is used to add variables to a dataset. Consider the following example, using monthly rainfall totals in Gainesville in 1995 and 1996. When used to combine observations into one dataset, the SET command works as follows:
DATA rain95; INPUT month rainfall @@; DATALINES; 1 3.08 2 1.07 3 6.14 4 5.18 5 2.47 6 7.55 7 7.66 8 7.20 9 2.10 10 4.33 11 3.15 12 1.29 ; DATA rain96; INPUT month rainfall @@; DATALINES; 1 0.97 2 0.66 3 10.52 4 1.72 5 2.01 6 6.05 7 11.00 8 4.90 9 2.23 10 6.18 11 1.73 12 6.63 ; DATA rain9596; SET rain95 rain96; RUN;
This program correctly creates three datasets: RAIN95, RAIN96, and RAIN9596, which contains all 24 observations from both daatsets. However, RAIN9596 does not contain a variable for the year, so SAS would not be able to tell whether 0.97 inches fell in January 1995 or January 1996. This can be corrected by adding a YEAR variable to both datasets, as shown below:
DATA rain95; INPUT month rainfall @@; year=1995; DATALINES; 1 3.08 2 1.07 3 6.14 4 5.18 5 2.47 6 7.55 7 7.66 8 7.20 9 2.10 10 4.33 11 3.15 12 1.29 ; DATA rain96; INPUT month rainfall @@; year=1996; DATALINES; 1 0.97 2 0.66 3 10.52 4 1.72 5 2.01 6 6.05 7 11.00 8 4.90 9 2.23 10 6.18 11 1.73 12 6.63 ; DATA rain9596; SET rain95 rain96; RUN;
There is an alternate method to do this using IN variables.
DATA rain95; INPUT month rainfall @@; DATALINES; 1 3.08 2 1.07 3 6.14 4 5.18 5 2.47 6 7.55 7 7.66 8 7.20 9 2.10 10 4.33 11 3.15 12 1.29 ; DATA rain96; INPUT month rainfall @@; DATALINES; 1 0.97 2 0.66 3 10.52 4 1.72 5 2.01 6 6.05 7 11.00
8 4.90 9 2.23 10 6.18 11 1.73 12 6.63 ; DATA rain9596; SET rain95(IN=i95) rain96(IN=i96); IF (i95=1 & i96=0) THEN year=1995; ELSE IF (i95=0 & i96=1) THEN year=1996; PROC PRINT DATA=rain9596; RUN;
In the last few steps, SAS creates an invisible variable, I95, which is equal to 1 if the observation came from the 1995 data and is 0 otherwise. Likewise, I96 indicates if an observation came from the 1996 data. I95 and I96 are not stored in the dataset, but they can be used to define other variables within the same DATA step. The statements in the program are redundant, since I95=1 implies that I96=0 and vice versa. A few lines from the output of the above SAS program are as follows:
OBS MONTH RAINFALL YEAR 1 1 3.08 1995 2 2 1.07 1995 .............................. 13 1 0.97 1996 14 2 0.66 1996 ...............................If you want to see the values for I95 and I96 on a printout, you must assign new variables in the DATA step which are calculated from I95 and I96; for example:
newi95=i95; newi96=i96;
The MERGE statement adds the variables in one dataset to another dataset. Consider the following example using Southern teams in the National Basketball Association:
DATA nba1; INPUT @1 city $10. @11 division $; DATALINES; Orlando Atlantic Miami Atlantic Atlanta Central Charlotte Central ; DATA nba2; INPUT mascot $ @@; DATALINES; Magic Heat Hawks Hornets ; DATA nba3; MERGE nba1 nba2; PROC PRINT DATA=nba3; RUN;
In this example, the observations have already been ordered so that each dataset corresponds to the other. The two data sets do not have to have the same dimension. If DATA nba2 had only three values, then the 4th datum in nba3 would be Charlotte Central ., i.e., with the mascot missing for the undefined 4th column. The output is as follows:
Obs city division mascot 1 Orlando Atlantic Magic 2 Miami Atlantic Heat 3 Atlanta Central Hawks 4 Charlotte Central Hornets
You will often need to put datasets together based on values of variables which are included in both datasets. To do this, both datasets must first be sorted in order by the common variable or variables. Consider this example:
DATA nba1; INPUT @1 city $10. @11 division $; DATALINES; Orlando Atlantic Miami Atlantic Atlanta Central Charlotte Central ; DATA nba2; INPUT @1 city $10. @11 mascot $; DATALINES; Atlanta Hawks Orlando Magic Miami Heat Charlotte Hornets ; PROC SORT DATA=nba1; BY city; PROC SORT DATA=nba2; BY city; DATA nba3; MERGE nba1 nba2; BY city; RUN;
The output is the following.
Obs city division mascot 1 Atlanta Central Hawks 2 Charlotte Central Hornets 3 Miami Atlantic Heat 4 Orlando Atlantic Magic
Finally, consider the rainfall data. Earlier, we created a dataset with one observation for each month and year. For paired-data analyses such as paired t-tests or Wilcoxon signed-rank tests, it would be beneficial to have one observation for each month, with the 1995 and 1996 rainfall totals appearing on the same line of data. For this, we could merge the RAIN95 and RAIN96 datasets together. However, the name RAINFALL could not be used for totals in both years. The RENAME statement is useful in this situation.
DATA rain95;
INPUT month rainfall @@;
DATALINES;
1 3.08 2 1.07 3 6.14 4 5.18 5 2.47 6 7.55
7 7.66 8 7.20 9 2.10 10 4.33 11 3.15 12 1.29
;
DATA rain96;
INPUT month rainfall @@;
DATALINES;
1 0.97 2 0.66 3 10.52 4 1.72 5 2.01 6 6.05 7 11.00
8 4.90 9 2.23 10 6.18 11 1.73 12 6.63
;
/* The data are already sorted in the correct order, but
PROC SORT is included here to reinforce the idea that
both datasets must be sorted correctly. */
PROC SORT DATA=rain95;
BY month;
PROC SORT DATA=rain96;
BY month;
DATA rain9596;
MERGE rain95(RENAME=(rainfall=r95))
rain96(RENAME=(rainfall=r96));
BY month;
RUN;
The program produces the following output.
OBS MONTH R95 R96 1 1 3.08 0.97 2 2 1.07 0.66 3 3 6.14 10.52 4 4 5.18 1.72 5 5 2.47 2.01 6 6 7.55 6.05 7 7 7.66 11.00 8 8 7.20 4.90 9 9 2.10 2.23 10 10 4.33 6.18 11 11 3.15 1.73 12 12 1.29 6.63
You can rename more than one variable such as RENAME=(rainfall=r96 other=x96)).