In Lesson 2, an example involving students' grades was used to illustrate various ways to input data. Suppose that Eve, another student, has not yet taken the exam and has not completed her project because she has been absent for four days. The dataset of grades may appear as follows in the text file A:\GRADES.TXT.
Ann 84 90 A- 0 Bill 78 84 B 0 Cathy 95 89 A 1 David 84 88 B+ 1 Eve 79 4This dataset could not be constructed properly using list input; SAS would interpret the 4 as the exam grade, with missing values for the project grade and absences. Instead, column input could be used with explicit labeling of columns for each variable.
DATA grades; INFILE 'a:\grades.txt'; INPUT name $ 1-5 quiz 7-8 exam 10-11 project $ 13-14 absences 16;Alternatively, you could enter the data with list input, but you would need to substitute values for the blanks in Eve's row of data. By letting ? be a code for an incomplete project grade and using a period for the missing exam score, the following could be used:
DATA grades; INPUT name $ quiz exam project $ absences; DATALINES; Ann 84 90 A- 0 Bill 78 84 B 0 Cathy 95 89 A 1 David 84 88 B+ 1 Eve 79 . ? 4 ;Now, suppose that the data are stored in A:\GRADES.TXT as follows:
Ann,84,90,A-,0 Bill,78,84,B,0 Cathy,95,89,A,1 David,84,88,B+,1 Eve,79,,,4Then, the instructions to SAS must specify that commas are used to separate variables, and consecutive commas indicate a missing value. The DSD (Delimiter Separated Data) option in the INFILE statement is used to specify the latter, as shown below.
DATA grades; INFILE 'a:\grades.txt' DELIMITER=',' DSD; INPUT name $ quiz exam project $ absences;If the last line of data had been written as 'Eve,79,.,I,4', then the same statements could be used with or without the DSD option.
When using an INFILE statement with incomplete data, you may need to use the MISSOVER option to tell SAS to fill in missing values at the end of a line. For example, suppose that the teacher of the class did not mark anything for absences when the student was always present, and the data set A:\GRADES.TXT looks like this:
Ann 84 90 A- Bill 78 84 B Cathy 95 89 A 1 David 84 88 B+ 1Consider this program:
DATA grades; INFILE 'a:\grades.txt'; INPUT name $ quiz exam project $ absences;For Ann, SAS would correctly read the name and the three grades. However, since there is no data for absences, SAS would skip ahead to the next line and try to read "Bill" as the number of Ann's absences, causing an error. The following program with MISSOVER will correctly assign missing values for absences to Ann and Bill.
DATA grades; INFILE 'a:\grades.txt' MISSOVER; INPUT name $ quiz exam project $ absences;
DATA bacteria; INPUT temperat replicat count @@; DATALINES; -2 1 2.6e3 -1 1 1.4e4 -.5 1 1.7e6 -2 2 5.9e3 -1 2 4.7e4 -.5 2 5.0e5 -2 3 1.4e2 -1 3 2.9e5 -.5 3 5.4e6 ;
A few selected informats are described below. For a more comprehensive list of informats, see SAS® Language: Reference, Version 6, First Edition.
Ann 84 90 A- 0 Bill 78 84 B 0 Catherine 95 89 A 1 David 84 88 B+ 1Previously, the following INPUT statement was used:
INPUT name $ 1-9 quiz test project $ absences;Another way to construct the same dataset is to use an informat to specify that the student's first names can contain up to 9 letters. The informats $9. and $2. can be used as follows:
DATA grades; INPUT name $9. quiz exam project $2. absences; DATALINES; Ann 84 90 A- 0 Bill 78 84 B 0 Catherine 95 89 A 1 David 84 88 B+ 1 ;The $9. is needed to correctly capture all of the students' names, but the $2. could have been replaced with a single dollar sign. Recall that SAS implicitly assumes that character variables will contain eight or fewer characters unless specified otherwise (newer version may relax this requirement). The informats $CHAR9. and $CHAR2. could also be used. The difference between $9. and $CHAR9. is that $9. removes all blanks at the beginning of the variable, while $CHAR9. keeps them in place. With $CHAR9., the name "Bill" would appear in the dataset as " Bill."
Florida 1,206 89 93.0% Georgia 723 114 67.7% South Carolina 1,023 103 69.8% Alabama 780 116 67.4% Mississippi 434 135 30.7% Louisiana 1,062 203 75.0%Consider the following lines of SAS code:
DATA crime; INPUT state $14. @16 violent comma5.0 murder 4.1 metro percent5.; DATALINES; Florida 1,206 89 93.0% Georgia 723 114 67.7% South Carolina 1,023 103 69.8% Alabama 780 116 67.4% Mississippi 434 135 30.7% ;Several informats are used here. First, $14. is used to specify that the state name can have up to 14 characters. @16 moves the line pointer to column 16. Next, comma5.0 specifies that VIOLENT is a numeric variable with embedded commas that may be written with one to five characters (5) and no digits after the decimal point (.0). Without an appropriate informat, SAS will assume that the violent crime rates for Florida and South Carolina are missing.
In the data, the MURDER variable is expressed as a proportion of 1,000,000 people. To be consistent with the VIOLENT variable, it can be written in terms of 100,000 people. One way to do this is to insert a decimal point before the last digit; for example, Florida's rate of 89/1,000,000 is equivalent to 8.9/100,000. The informat 4.1 tells SAS to read in the number for murders, then write it so that it has 4 characters (including the decimal point), with one digit after the decimal place. Thus, 89 becomes 8.9 and 114 becomes 11.4. If you wanted the rates with denominators of 1,000 people, so that 89/1,000,000 appeared in the data as .089, you would use the informat 4.3 instead. Finally, percent5. tells SAS to strip the percent signs from METRO. After reading the data with the informats in the program above, SAS interprets the data as follows:
OBS STATE VIOLENT MURDER METRO 1 Florida 1206 8.9 93.0 2 Georgia 723 11.4 67.7 3 South Carolina 1023 10.3 69.8 4 Alabama 780 11.6 67.4 5 Mississippi 434 13.5 30.7
1/3/98 Tennessee 7:00 Men 1/4/98 Kentucky 3:00 Women 1/7/98 @Ole Miss 8:00 Men 1/7/98 Georgia 7:00 Women 1/10/98 @Tennessee 7:00 Women 1/10/98 Georgia 2:00 Men 1/14/98 @Arkansas 8:05 Men 1/14/98 @Auburn 8:00 Women 1/17/98 South Carolina 8:00 Women 1/17/98 LSU 12:00 Men 1/24/98 Vanderbilt 1:00 Men 1/24/98 @LSU 3:00 Women 1/28/98 @Kentucky 7:00 Women 1/28/98 @South Carolina 7:30 MenAs written above, in the format month/day/year, a date can span as few as 6 characters (1/1/98) or as many as 8 characters (12/31/98). The times present another problem. Obviously, a 1:00 game will be played in the afternoon, but SAS doesn't know that. The starting times must be written in 24-hour or military time. The following statements could be used to read and print the data.
DATA schedule; INPUT @1 gameday mmddyy8. @9 location $1. @10 opponent $14. @25 gametime time5. @31 team $; DATALINES; 1/3/98 Tennessee 19:00 Men 1/4/98 Kentucky 15:00 Women 1/7/98 @Ole Miss 20:00 Men 1/7/98 Georgia 19:00 Women 1/10/98 @Tennessee 19:00 Women 1/10/98 Georgia 14:00 Men 1/14/98 @Arkansas 20:05 Men 1/14/98 @Auburn 20:00 Women 1/17/98 South Carolina 20:00 Women 1/17/98 LSU 12:00 Men 1/24/98 Vanderbilt 13:00 Men 1/24/98 @LSU 15:00 Women 1/24/00 @Kentucky 19:00 Women 2/24/02 @South Carolina 19:30 Men ; PROC PRINT DATA=schedule; RUN;The mmddyy8. informat tells SAS that the date appears in the order month (mm), day (dd), and year (yy), and that eight or fewer characters are used to write the date. The time5. informat says that the time of day is written with 5 characters; more characters would be needed if the time included hours, minutes, and seconds.
The program above produces the following output:
OBS GAMEDAY LOCATION OPPONENT GAMETIME TEAM 1 13882 Tennessee 68400 Men 2 13883 Kentucky 54000 Women 3 13886 @ Ole Miss 72000 Men 4 13886 Georgia 68400 Women 5 13889 @ Tennessee 68400 Women 6 13889 Georgia 50400 Men 7 13893 @ Arkansas 72300 Men 8 13893 @ Auburn 72000 Women 9 13896 South Carolina 72000 Women 10 13896 LSU 43200 Men 11 13903 Vanderbilt 46800 Men 12 13903 @ LSU 54000 Women 13 14633 @ Kentucky 68400 Women 14 15395 @ South Carolina 70200 MenHuh? What happened?
SAS regards a date as the number of days since January 1, 1960. (Why? It's a closely guarded secret. It was probably some SAS programmer's kid's birthday or something like that.) Days before January 1, 1960 are stored as negative integers, and SAS automatically compensates for leap years. This way of storing dates is actually quite useful, since finding the time elapsed between two dates requires only subtracting two numbers. Likewise, times of day are stored as the number of seconds past midnight. SAS can print the dates and times in more recognizable ways with FORMAT statements; these will be described later.
You may need to experiment with different ways to create a given SAS dataset with dates and/or times. Often, column input or line pointers must be used to enter these values correctly. A convenient way of entering dates in SAS that allows easy list input is to enter the month, day, and year as three separate numeric variables, then create a date from them with the MDY() function. This will be discussed later.
The following example shows the flexibility and options in SAS dates input and output. Note that the output by format will be studied int he next section.
DATA datinput; INPUT name $ @10 date1 DATE10. @21 date2 mmddyy8.
@31 date3 ddmmyy10.;
DATALINES;
John 01 jan 73 02/01/95 02-04-94
Mary 04 Aug 74 12-01-94 02/07/99
Jack 12 feb 75 02 11 95 12-11-95
Steve 15 sep 83 2/1/02 03 04 2001
PROC PRINT data=datinput;
FORMAT date1 mmddyy10. date2 weekdate17. date3 worddate18.;
The ouput is:
OBS NAME DATE1 DATE2 DATE3 1 John 01/01/1973 Wed, Feb 1, 1995 April 2, 1994 2 Mary 08/04/1974 Thu, Dec 1, 1994 July 2, 1999 3 Jack 02/12/1975 Sat, Feb 11, 1995 November 12, 1995 4 Steve 09/15/1983 Tue, Jan 22, 2002 April 3, 2001
INFILE 'a:\example.txt' LRECL=150;
| First name | Last name | Exam 1 | Exam 2 | Exam 3 | Final exam |
| Beverly | Crusher | 85 | 99 | 68 | 94 |
| Victor | Ehrlich | 69 | . | . | . |
| Joel | Fleischman | 77 | 92 | 88 | 90 |
| Michaela | Quinn | 78 | 93 | 84 | 85 |
| Kerry | Weaver | 89 | 84 | 76 | 84 |
| Frank | Burns | 49 | 68 | 64 | 59 |
| First name | Beverly | Victor | Joel | Michaela | Kerry | Major
Frank |
|
| Last name | Crusher | Ehrlich | Fleischman | Quinn | Weaver | Burns | |
| Exam 1 | 85 | 69 | 77 | 78 | 89 | 49 | |
| Exam 2 | 99 | absent | 92 | 93 | 84 | 68 | |
| Exam 3 | 68 | sick | 88 | 84 | 76 | 64 | |
| Final exam | 94 | 90 | 85 | 84 | 59 | ||
| Averages | 86.5 | 69 | 86.75 | 85 | 83.25 | 60 | |
| *should be a pastry chef |
An easy way to export data is to use PROC PRINT to print the data with variable names over the columns, then cut the output from the SAS window and paste it into the spreadsheet. However, this will not always work. For example, if the dataset has many variables, SAS will try to list all of those variables for at least one observation on each page, and the pages may not be separated at appropriate places. The following method works for any dataset and any version of SAS.
Suppose that you want to export the dataset of crime rates in Southern states to a comma-delimited file, which could subsequently be transferred to a spreadsheet program. One way to do this is shown below.
DATA crime; INPUT state $14. @16 violent comma5.0 murder 4.1 metro percent5.; DATALINES; Florida 1,206 89 93.0% Georgia 723 114 67.7% South Carolina 1,023 103 69.8% Alabama 780 116 67.4% Mississippi 434 135 30.7% ; DATA _null_; SET crime; FILE 'a:\crimes.csv'; PUT state ',' violent ',' murder ',' metro; RUN;_NULL_ is a special dataset name that tells SAS that operations to a dataset are to be performed, but the results do not have to be stored in memory. After running the program, the following text appears in CRIMES.CSV on Drive A.
Florida ,1206 ,8.9 ,93 Georgia ,723 ,11.4 ,67.7 South Carolina ,1023 ,10.3 ,69.8 Alabama ,780 ,11.6 ,67.4 Mississippi ,434 ,13.5 ,30.7This file could easily be imported into a spreadsheet which recognizes comma-delimited files. For example, in Microsoft Excel, you could simply choose File, then Open..., then find and double-click on the name of the file produced by SAS. Files with the .CSV extension are recognized as comma-delimited, and Excel will properly arrange the data into 5 rows and 4 columns.