In the previous lesson, the following statements were used to create a dataset.
DATA oranges;This is an example of column input, in which the columns of text in which data are stored are explicitly furnished to SAS. It may be impractical to follow this example for all datasets. SAS has facilities for reading data stored in many different ways; some of these facilities are described below.
INPUT state $ 1-10 early 12-14 late 16-18;
DATALINES;
Florida 130 90
California 37 26
Texas 1.3 .15
Arizona .65 .85
;
Projected Orange Yields in October 1997
State Early Late
Florida 130 90
California 37 26
Texas 1.3 .15
Arizona .65 .85
Based on information obtained from the
Florida Agricultural Statistics Service
The following lines of code could be used to read the dataset:
DATA oranges;The INFILE statement replaces DATALINES. INFILE gives the location of the external text file, including the drive name and any subdirectories. The FIRSTOBS option tells SAS to skip the first two lines of the file and to begin reading data on line 3. The OBS option tells SAS that Line 6 is the last line which contains legitimate data. If the data were edited in a text file by removing the top two lines and the bottom two lines, so that the text file contained only data. Then, the INFILE statement listed below would be sufficient.
INFILE 'a:\yields\oranges.dat' FIRSTOBS=3 OBS=6;
INPUT state $ 1-10 early 12-14 late 16-18;
INFILE 'a:\yields\oranges.dat';
FILENAME kodiak url 'http://lib.stat.cmu.edu:80/crab/eggs';In these statements, kodiak is a nickname used by SAS to refer to the longer Internet address. The statements create a dataset called eggdata which contains two variables, year and numeggs.
DATA eggdata;
INFILE kodiak;
INPUT year 1-2 numeggs 4-9;
SAS can also read data from FTP sites, but you must supply the appropriate information about the FTP site address, subdirectories, user names, and passwords. This example shows how to obtain the eggs dataset by anonymous FTP.
FILENAME kodiak ftp 'eggs' cd='/crab/'Of course, it would be easy to find the eggs dataset in a Web browser, save the file as a text file on your computer, and use the methods for reading data from a text file. However, this method would be useful if the data files are very large. It is also convenient if data files are continually updated; example include stock market data, weather data, and batting averages.
user='anonymous' pass='guest' host='lib.stat.cmu.edu';
DATA eggdata;
INFILE kodiak;
INPUT year 1-2 numeggs 4-9;
A LIBNAME statement is needed to create a permanent dataset or to read one that has already been created. The LIBNAME is a surrogate name for the location on a disk where the permanent dataset is or will be stored. For example, consider the following statement:
LIBNAME college 'A:';This prepares SAS to look in Drive A (your diskette) for permanent datasets. The name college is called a libref (library reference). The names that you can supply for librefs follow the same rules as dataset and variable names; however, you should not use the names LIBRARY, WORK, USER, or anything starting with the three letters SAS, since these are reserved for special uses within SAS.
For example, the following statements create a permanent SAS dataset.
LIBNAME college 'a:';This creates the new file ENROLLED.SD2 in your diskette (a: drive). This is the new permanent SAS dataset. Only SAS will be able to interpret this file; you will not be able to see its contents by using a word processor or spreadsheet program. The two statements after data work like the following: "DATA name1; SET name2;" means to create a data set name1 from dataset name2. The "IF ... THEN DELETE;" statement does the obvious operation. We will spend more time on them in Lesson 5. Thus, the data set college.enrolled has only three obeservations with class='day'. In order to use the dataset that was just created, you must refer to it with its full name, college.enrolled . To retrieve this data set from your diskette:
DATA original;
INPUT dept $ 1-8 count 10-13 class $ 15-21;
DATALINES;
FineArts 449 day
Science 1411 day
Music 259 evening
Language 759 day
;
DATA college.enrolled;SET original;
IF class='evening' THEN DELETE;
PROC PRINT;
RUN;
LIBNAME college 'A:';You will see the data college.enrolled retrieve from the diskette and printed.
DATA tempenrl;SET college.enrolled;
PROC PRINT DATA=tempenrl;
RUN;
DATA grades;This would not work correctly with the following data:
INPUT student $ quiz test project $ absences;
DATALINES;
Ann 84 90 A- 0
Bill 78 84 B 0
Cathy 95 89 A 1
David 84 88 B+ 1
;
Ann/84/90/A-/0Then, the following statements could be used to create the GRADES dataset.
Bill/78/84/B/0
Cathy/95/89/A/1
David/84/88/B+/1
DATA grades;The phrase 'dlm=' can be used in place of 'delimiter='. This option is used if a keyboard character, such as a comma, slash, or asterisk, separates values in a line. Of course, the character used to separate variables should not appear within a data value. For example, in a comma-delimited file, the number 125,000 would have to be written as 125000; otherwise, SAS would try to break it apart into two variables with values 125 and 000.
INFILE 'a:\grades.txt' delimiter='/';
INPUT name $ quiz test project $ absences;
Tabs are an exception to the use of the DELIMITER option. In a tab-delimited file, the EXPANDTABS option replaces the DELIMITER option. In the example, if tabs had been used in place of slashes, the proper statement for reading the data would be:
INFILE 'a:\grades.txt' expandtabs;
Ann 84 90 A- 0Creating a dataset from this file would be easy to read with column input and even easier with list input. However, suppose that you only needed to use the students' names and project grades. You can use column pointers to skip over undesired data. Column pointers use the @ symbol to tell SAS to begin reading data at a specified column. In this example, we need to know that the names start in Column 1; project grades, in Column 13. The following statements could be used.
Bill 78 84 B 0
Cathy 95 89 A 1
David 84 88 B+ 1
DATA grades;
INFILE 'a:\grades.txt';The quiz grade, test grade, and absences do not appear in this dataset. As shown above, column pointers can be used to skip over unneeded data.
INPUT @1 name $ @13 project $;
Ann 84 90 A- 0Recall that list input can be used only when character variables have 8 or fewer characters, with no blanks. Catherine has 9 letters, so simple list input cannot be used. However, the absences are not neatly aligned in a column, and the last four variables would be easy to read with list input. The following statements could be used:
Bill 78 84 B 0
Catherine 95 89 A 1
David 84 88 B+ 1
DATA grades;It is also possible to use list input, column input, and column pointers simultaneously. For example, if you only needed the name, project grade, and absences, you could use the following INPUT statement:
INFILE 'a:\grades.txt';
INPUT name $ 1-9 quiz test project $ absences;
INPUT name $ 1-9 @17 project $ absences;
AnnYou may need to use a line pointer to read such data. A line pointer is like a column pointer, except that it specifies the line on which SAS should begin reading the data. A slash (/) tells SAS to skip to the next line, and #number tell SAS to go to that line of an observation's data to resume reading data. If the data above were stored in A:\GRADES.TXT, a dataset could be created in SAS as follows:
84 90 A- 0
Bill
78 84 B 0
Cathy
95 89 A 1
David
84 88 B+ 1
DATA grades;Equivalently, you could use the following INPUT statement:
INFILE 'a:\grades.txt';
INPUT name $ / quiz test project $ absences;
INPUT name $ #2 quiz test project $ absences;Since the data consist of simple characters and numbers, the following INPUT statement could also be used. Notice that there are no line pointers.
INPUT name $ quiz test project $ absences;SAS will automatically go to the next line of data to complete the set of variables listed in the INPUT statement. However, if the data are irregular (missing values, blanks in character variables, etc.), then line pointers may be necessary.
DATA grades;
INPUT name $ quiz test project $ absences @@;
DATALINES;
Ann 84 90 A- 0 Bill 78 84 B 0 Cathy 95 89 A 1
David 84 88 B+ 1
;