Lesson 3: Special Types of Data and Working with Spreadsheets

Missing data

If you are not using column input or column pointers, you may need to specify that a particular value in a dataset is missing. SAS regards a single period (.) as a missing numeric value and a blank space as a missing character and numeric values (only in column input). SAS also uses these conventions when printing data or results of calculations. When SAS analyzes data with missing values, any observations with missing values of variables required in the analyses are automatically excluded from consideration.

In Lesson 2, an example involving students' grades was used to illustrate various ways to input data. Suppose that Eve, another student, has not yet taken the exam and has not completed her project because she has been absent for four days. The dataset of grades may appear as follows in the text file A:\GRADES.TXT.

Ann   84 90 A- 0
Bill  78 84 B  0
Cathy 95 89 A  1
David 84 88 B+ 1
Eve   79       4
This dataset could not be constructed properly using list input; SAS would interpret the 4 as the exam grade, with missing values for the project grade and absences. Instead, column input could be used with explicit labeling of columns for each variable.
DATA grades;
INFILE 'a:\grades.txt';
INPUT name $ 1-5 quiz 7-8 exam 10-11 project $ 13-14 absences
16;
Alternatively, you could enter the data with list input, but you would need to substitute values for the blanks in Eve's row of data. By letting ? be a code for an incomplete project grade and using a period for the missing exam score, the following could be used:
DATA grades;
INPUT name $ quiz exam project $ absences;
DATALINES;
Ann    84 90   A-  0
Bill  78  84  B   0
Cathy   95 89 A  1
David   84 88 B+ 1
Eve   79 .   ?   4
;
Now, suppose that the data are stored in A:\GRADES.TXT as follows:
Ann,84,90,A-,0
Bill,78,84,B,0
Cathy,95,89,A,1
David,84,88,B+,1
Eve,79,,,4
Then, the instructions to SAS must specify that commas are used to separate variables, and consecutive commas indicate a missing value. The DSD (Delimiter Separated Data) option in the INFILE statement is used to specify the latter, as shown below.
DATA grades;
INFILE 'a:\grades.txt' DELIMITER=',' DSD;
INPUT name $ quiz exam project $ absences;
If the last line of data had been written as 'Eve,79,.,I,4', then the same statements could be used with or without the DSD option.

When using an INFILE statement with incomplete data, you may need to use the MISSOVER option to tell SAS to fill in missing values at the end of a line. For example, suppose that the teacher of the class did not mark anything for absences when the student was always present, and the data set A:\GRADES.TXT looks like this:

Ann   84 90 A-
Bill  78 84 B
Cathy 95 89 A  1
David 84 88 B+ 1
Consider this program:
DATA grades;
INFILE 'a:\grades.txt';
INPUT name $ quiz exam project $ absences;
For Ann, SAS would correctly read the name and the three grades. However, since there is no data for absences, SAS would skip ahead to the next line and try to read "Bill" as the number of Ann's absences, causing an error. The following program with MISSOVER will correctly assign missing values for absences to Ann and Bill.
DATA grades;
INFILE 'a:\grades.txt' MISSOVER;
INPUT name $ quiz exam project $ absences;

Numeric data

In all of the examples presented previously, the numeric variables have been simple positive integers or decimals. SAS can also read negative numbers and numbers written in scientific notation, and no special instructions are necessary. For negative numbers, just enter a minus sign hyphen immediately before the first digit or decimal point. For example, the number -0.03 can be written as -0.03 or just -.03 . For numbers in scientific notation, use the letter e to represent the phrase "times ten to the power of." (This is not the same as the number e=2.71828..., 1.7e6=1.7x106.) The following example shows one way to create a dataset for a hypothetical experiment involving counts of bacteria in petri dishes at varying temperatures in degrees Celsius.
DATA bacteria;
INPUT temperat replicat count @@;
DATALINES;
-2 1 2.6e3   -1 1 1.4e4   -.5 1 1.7e6 
-2 2 5.9e3   -1 2 4.7e4   -.5 2 5.0e5
-2 3 1.4e2   -1 3 2.9e5   -.5 3 5.4e6
;

Informats

Some types of data require special instructions for SAS to interpret them correctly. Examples include numbers with commas (4,162,385), dates (7/4/1776), and times of day (2:00). When such values appear in the lines of data, you must supply an appropriate informat for them. An informat is an instruction which tells SAS how the corresponding variable should be interpreted. The informat appears after the name of its variable in the INPUT statement and contains information about the type of variable, the length of the variable, and interpretations of such characters as commas and decimal points. Informats end with periods, to distinguish them from variable names.

A few selected informats are described below. For a more comprehensive list of informats, see SAS® Language: Reference, Version 6, First Edition.

Character informats

Consider the following data from Lesson 2:
Ann       84 90 A- 0
 Bill     78 84 B  0
Catherine 95 89 A  1
David     84 88 B+ 1
Previously, the following INPUT statement was used:
INPUT name $ 1-9 quiz test project $ absences;
Another way to construct the same dataset is to use an informat to specify that the student's first names can contain up to 9 letters. The informats $9. and $2. can be used as follows:
DATA grades;
INPUT name $9. quiz exam project $2. absences;
DATALINES;
Ann       84 90 A- 0
 Bill     78 84 B  0
Catherine 95 89 A  1
David     84 88 B+ 1
;
The $9. is needed to correctly capture all of the students' names, but the $2. could have been replaced with a single dollar sign. Recall that SAS implicitly assumes that character variables will contain eight or fewer characters unless specified otherwise (newer version may relax this requirement). The informats $CHAR9. and $CHAR2. could also be used. The difference between $9. and $CHAR9. is that $9. removes all blanks at the beginning of the variable, while $CHAR9. keeps them in place. With $CHAR9., the name "Bill" would appear in the dataset as " Bill."

Numeric informats

The following data appear in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, Upper Saddle River, NJ, 1997). The variables are, respectively, state name, violent crimes (per 100,000 people), murders (per 1,000,000 people), and the percentage of the population living in metropolitan areas, for some Southern states.
Florida        1,206  89 93.0%
Georgia          723 114 67.7%
South Carolina 1,023 103 69.8%
Alabama          780 116 67.4%
Mississippi      434 135 30.7%
Louisiana      1,062 203 75.0%
Consider the following lines of SAS code:
DATA crime;
INPUT state $14. @16 violent comma5.0 murder 4.1 metro
percent5.;
DATALINES;
Florida        1,206  89 93.0%
Georgia          723 114 67.7%
South Carolina 1,023 103 69.8%
Alabama          780 116 67.4%
Mississippi      434 135 30.7%
;
Several informats are used here. First, $14. is used to specify that the state name can have up to 14 characters. @16 moves the line pointer to column 16. Next, comma5.0 specifies that VIOLENT is a numeric variable with embedded commas that may be written with one to five characters (5) and no digits after the decimal point (.0). Without an appropriate informat, SAS will assume that the violent crime rates for Florida and South Carolina are missing.

In the data, the MURDER variable is expressed as a proportion of 1,000,000 people. To be consistent with the VIOLENT variable, it can be written in terms of 100,000 people. One way to do this is to insert a decimal point before the last digit; for example, Florida's rate of 89/1,000,000 is equivalent to 8.9/100,000. The informat 4.1 tells SAS to read in the number for murders, then write it so that it has 4 characters (including the decimal point), with one digit after the decimal place. Thus, 89 becomes 8.9 and 114 becomes 11.4. If you wanted the rates with denominators of 1,000 people, so that 89/1,000,000 appeared in the data as .089, you would use the informat 4.3 instead. Finally, percent5. tells SAS to strip the percent signs from METRO. After reading the data with the informats in the program above, SAS interprets the data as follows:

OBS    STATE             VIOLENT    MURDER    METRO
 1     Florida             1206       8.9      93.0
 2     Georgia              723      11.4      67.7
 3     South Carolina      1023      10.3      69.8
 4     Alabama              780      11.6      67.4
 5     Mississippi          434      13.5      30.7

Date and time informats

The capability to perform calculations with dates and times is one of the major advantages that SAS has over other statistical software packages, but manipulating time variables is somewhat complicated. Consider the Gators' schedule of conference basketball games in January 1998.
1/3/98   Tennessee       7:00 Men
1/4/98   Kentucky        3:00 Women
1/7/98  @Ole Miss        8:00 Men
1/7/98   Georgia         7:00 Women
1/10/98 @Tennessee       7:00 Women
1/10/98  Georgia         2:00 Men
1/14/98 @Arkansas        8:05 Men
1/14/98 @Auburn          8:00 Women
1/17/98  South Carolina  8:00 Women
1/17/98  LSU            12:00 Men
1/24/98  Vanderbilt      1:00 Men
1/24/98 @LSU             3:00 Women
1/28/98 @Kentucky        7:00 Women
1/28/98 @South Carolina  7:30 Men
As written above, in the format month/day/year, a date can span as few as 6 characters (1/1/98) or as many as 8 characters (12/31/98). The times present another problem. Obviously, a 1:00 game will be played in the afternoon, but SAS doesn't know that. The starting times must be written in 24-hour or military time. The following statements could be used to read and print the data.
DATA schedule;
INPUT @1 gameday mmddyy8. @9 location $1. @10 opponent $14. @25
gametime time5. @31 team $;
DATALINES;
1/3/98   Tennessee      19:00 Men
1/4/98   Kentucky       15:00 Women
1/7/98  @Ole Miss       20:00 Men
1/7/98   Georgia        19:00 Women
1/10/98 @Tennessee      19:00 Women
1/10/98  Georgia        14:00 Men
1/14/98 @Arkansas       20:05 Men
1/14/98 @Auburn         20:00 Women
1/17/98  South Carolina 20:00 Women
1/17/98  LSU            12:00 Men
1/24/98  Vanderbilt     13:00 Men
1/24/98 @LSU            15:00 Women
1/24/00 @Kentucky       19:00 Women
2/24/02 @South Carolina 19:30 Men
;
PROC PRINT DATA=schedule;
RUN;
The mmddyy8. informat tells SAS that the date appears in the order month (mm), day (dd), and year (yy), and that eight or fewer characters are used to write the date. The time5. informat says that the time of day is written with 5 characters; more characters would be needed if the time included hours, minutes, and seconds.

The program above produces the following output:

OBS    GAMEDAY    LOCATION  OPPONENT        GAMETIME  TEAM

  1     13882               Tennessee         68400   Men
  2     13883               Kentucky          54000   Women
  3     13886        @      Ole Miss          72000   Men
  4     13886               Georgia           68400   Women
  5     13889        @      Tennessee         68400   Women
  6     13889               Georgia           50400   Men
  7     13893        @      Arkansas          72300   Men
  8     13893        @      Auburn            72000   Women
  9     13896               South Carolina    72000   Women
 10     13896               LSU               43200   Men
 11     13903               Vanderbilt        46800   Men
 12     13903        @      LSU               54000   Women
 13     14633        @      Kentucky          68400   Women
 14     15395        @      South Carolina    70200   Men
Huh? What happened?

SAS regards a date as the number of days since January 1, 1960. (Why? It's a closely guarded secret. It was probably some SAS programmer's kid's birthday or something like that.) Days before January 1, 1960 are stored as negative integers, and SAS automatically compensates for leap years. This way of storing dates is actually quite useful, since finding the time elapsed between two dates requires only subtracting two numbers. Likewise, times of day are stored as the number of seconds past midnight. SAS can print the dates and times in more recognizable ways with FORMAT statements; these will be described later.

You may need to experiment with different ways to create a given SAS dataset with dates and/or times. Often, column input or line pointers must be used to enter these values correctly. A convenient way of entering dates in SAS that allows easy list input is to enter the month, day, and year as three separate numeric variables, then create a date from them with the MDY() function. This will be discussed later.

The following example shows the flexibility and options in SAS dates input and output. Note that the output by format will be studied int he next section.

DATA datinput; INPUT name $ @10 date1 DATE10. @21 date2 mmddyy8. 
                     @31 date3 ddmmyy10.;
DATALINES;
John     01 jan 73  02/01/95  02-04-94
Mary     04 Aug 74  12-01-94  02/07/99
Jack     12 feb 75  02 11 95  12-11-95
Steve    15 sep 83  2/1/02    03 04 2001 
PROC PRINT data=datinput;
FORMAT date1 mmddyy10. date2 weekdate17. date3 worddate18.;

The ouput is:

 
 OBS    NAME          DATE1                DATE2                 DATE3

  1     John     01/01/1973     Wed, Feb 1, 1995         April 2, 1994
  2     Mary     08/04/1974     Thu, Dec 1, 1994          July 2, 1999
  3     Jack     02/12/1975    Sat, Feb 11, 1995     November 12, 1995
  4     Steve    09/15/1983     Tue, Jan 22, 2002        April 3, 2001


Converting spreadsheets to SAS datasets

A convenient way to prepare data for subsequent analysis within SAS is to prepare a spreadsheet of the data using Microsoft Excel, Corel Quattro Pro, or another spreadsheet program, then convert those spreadsheets to ASCII text files that can be read by SAS with an INFILE statement. Now that you know how SAS reads data from files, you can arrange the data to make it easy to read in SAS. Use the following guidelines.
INFILE 'a:\example.txt' LRECL=150;

Good example

Consider the grades that some doctors on TV may have made in their first anatomy class. A spreadsheet that would be easy to transfer to SAS would look like this:
First name Last name Exam 1 Exam 2 Exam 3 Final exam
Beverly Crusher 85 99 68 94
Victor  Ehrlich 69 . . .
Joel Fleischman 77 92 88 90
Michaela Quinn 78 93 84 85
Kerry  Weaver 89 84 76 84
Frank Burns 49 68 64 59
In the spreadsheet program, you could then choose to save the spreadsheet as an ASCII DOS text file, such as A:\TVDOCS.TXT. To save disk space, you could also save the file in a smaller file using comma- or tab-delimited format. Then, use the appropriate input technique for the file you have created.

Bad example

 
First name   Beverly Victor  Joel Michaela Kerry  Major 

Frank

Last name   Crusher Ehrlich  Fleischman Quinn Weaver Burns
               
Exam 1   85 69 77 78 89 49
Exam 2   99 absent 92 93 84 68
Exam 3   68 sick  88 84 76 64
Final exam   94 90 85 84 59
Averages   86.5 69 86.75 85 83.25 60
               
            *should be a pastry chef 
 

Converting SAS datasets to spreadsheets

You may need to transfer data from SAS to a spreadsheet program or some other program. For example, you may want to make a color graphics plot of your data with a spreadsheet. (SAS can do this, but the SAS graphics language is very difficult to learn.) Newer versions of SAS have the capability to automatically export datasets in files that can be recognized by some spreadsheets, such as dBASE and Microsoft Excel. However, this requires the installation of several optional SAS functions, and you may not have these on your system.

An easy way to export data is to use PROC PRINT to print the data with variable names over the columns, then cut the output from the SAS window and paste it into the spreadsheet. However, this will not always work. For example, if the dataset has many variables, SAS will try to list all of those variables for at least one observation on each page, and the pages may not be separated at appropriate places. The following method works for any dataset and any version of SAS.

Suppose that you want to export the dataset of crime rates in Southern states to a comma-delimited file, which could subsequently be transferred to a spreadsheet program. One way to do this is shown below.

DATA crime;
INPUT state $14. @16 violent comma5.0 murder 4.1 metro
percent5.;
DATALINES;
Florida        1,206  89 93.0%
Georgia          723 114 67.7%
South Carolina 1,023 103 69.8%
Alabama          780 116 67.4%
Mississippi      434 135 30.7%
;
DATA _null_;
SET crime;
FILE 'a:\crimes.csv';
PUT state ',' violent ',' murder ',' metro;
RUN;
_NULL_ is a special dataset name that tells SAS that operations to a dataset are to be performed, but the results do not have to be stored in memory. After running the program, the following text appears in CRIMES.CSV on Drive A.
Florida ,1206 ,8.9 ,93
Georgia ,723 ,11.4 ,67.7
South Carolina ,1023 ,10.3 ,69.8
Alabama ,780 ,11.6 ,67.4
Mississippi ,434 ,13.5 ,30.7
This file could easily be imported into a spreadsheet which recognizes comma-delimited files. For example, in Microsoft Excel, you could simply choose File, then Open..., then find and double-click on the name of the file produced by SAS. Files with the .CSV extension are recognized as comma-delimited, and Excel will properly arrange the data into 5 rows and 4 columns.


Homework problems for this lesson

Return to STA 5106 home page