Lesson 2: More Ways to Input Data

In the previous lesson, the following statements were used to create a dataset.

DATA oranges;
INPUT state $ 1-10 early 12-14 late 16-18;
DATALINES;
Florida    130  90
California  37  26
Texas      1.3 .15
Arizona    .65 .85
;
This is an example of column input, in which the columns of text in which data are stored are explicitly furnished to SAS. It may be impractical to follow this example for all datasets. SAS has facilities for reading data stored in many different ways; some of these facilities are described below.

Reading data from a text file

Suppose that the data above are stored on the diskette in Drive A, in the subdirectory YIELDS, under the name ORANGES.DAT. Blank spaces, not tabs, are used to separate the values. Also, the file is a simple ASCII text file. There are no hidden codes in the file pertaining to word processors (such as margins and font sizes) or spreadsheets (such as formulas and graphs).
Projected Orange Yields in October 1997
State    Early Late
Florida    130  90
California  37  26
Texas      1.3 .15
Arizona    .65 .85
Based on information obtained from the
Florida Agricultural Statistics Service

The following lines of code could be used to read the dataset:

DATA oranges;
INFILE 'a:\yields\oranges.dat' FIRSTOBS=3 OBS=6;
INPUT state $ 1-10 early 12-14 late 16-18;
The INFILE statement replaces DATALINES. INFILE gives the location of the external text file, including the drive name and any subdirectories. The FIRSTOBS option tells SAS to skip the first two lines of the file and to begin reading data on line 3. The OBS option tells SAS that Line 6 is the last line which contains legitimate data. If the data were edited in a text file by removing the top two lines and the bottom two lines, so that the text file contained only data. Then, the INFILE statement listed below would be sufficient.
INFILE 'a:\yields\oranges.dat';

Reading data from the Internet

Newer releases of SAS offer the capability to read data from files available on the Internet. For example, the eggs dataset at Carnegie Mellon University's StatLib contains data on the yearly average number of eggs produced by female king crabs near Kodiak Island, Alaska. The following statements could be used to create a SAS dataset.
FILENAME kodiak url 'http://lib.stat.cmu.edu:80/crab/eggs';
DATA eggdata;
INFILE kodiak;
INPUT year 1-2 numeggs 4-9;
In these statements, kodiak is a nickname used by SAS to refer to the longer Internet address. The statements create a dataset called eggdata which contains two variables, year and numeggs.

SAS can also read data from FTP sites, but you must supply the appropriate information about the FTP site address, subdirectories, user names, and passwords. This example shows how to obtain the eggs dataset by anonymous FTP.

FILENAME kodiak ftp 'eggs' cd='/crab/'
user='anonymous' pass='guest' host='lib.stat.cmu.edu';
DATA eggdata;
INFILE kodiak;
INPUT year 1-2 numeggs 4-9;
Of course, it would be easy to find the eggs dataset in a Web browser, save the file as a text file on your computer, and use the methods for reading data from a text file. However, this method would be useful if the data files are very large. It is also convenient if data files are continually updated; example include stock market data, weather data, and batting averages.

Creating and reading permanent SAS datasets

In all of the previous examples, the SAS datasets that have been created were temporary. They remain in working memory and can be used throughout the SAS session, but they disappear when the SAS session ends. Permanent SAS datasets can be created; these are stored on a disk and can be recalled easily in future SAS sessions. Permanent SAS datasets are convenient to use when the amount of data is large. Also, if you have to go through several steps to create a SAS dataset, you only need to do those steps once if you create a permanent dataset.

A LIBNAME statement is needed to create a permanent dataset or to read one that has already been created. The LIBNAME is a surrogate name for the location on a disk where the permanent dataset is or will be stored. For example, consider the following statement:

LIBNAME college 'A:';
This prepares SAS to look in Drive A (your diskette) for permanent datasets. The name college is called a libref (library reference). The names that you can supply for librefs follow the same rules as dataset and variable names; however, you should not use the names LIBRARY, WORK, USER, or anything starting with the three letters SAS, since these are reserved for special uses within SAS.

For example, the following statements create a permanent SAS dataset.

LIBNAME college 'a:';
DATA original;
INPUT dept $ 1-8 count 10-13 class $ 15-21;
DATALINES;
FineArts  449 day
Science  1411 day
Music     259 evening
Language  759 day
;
DATA college.enrolled;SET original;
IF class='evening' THEN DELETE;
PROC PRINT;
RUN;
This creates the new file ENROLLED.SD2 in your diskette (a: drive). This is the new permanent SAS dataset. Only SAS will be able to interpret this file; you will not be able to see its contents by using a word processor or spreadsheet program. The two statements after data work like the following: "DATA name1; SET name2;" means to create a data set name1 from dataset name2. The "IF ... THEN DELETE;" statement does the obvious operation. We will spend more time on them in Lesson 5. Thus, the data set college.enrolled has only three obeservations with class='day'. In order to use the dataset that was just created, you must refer to it with its full name, college.enrolled . To retrieve this data set from your diskette:
LIBNAME college 'A:';
DATA tempenrl;SET college.enrolled;
PROC PRINT DATA=tempenrl;
RUN;
You will see the data college.enrolled retrieve from the diskette and printed.

Reading data from spreadsheets

Newer versions of SAS have the ability to read data directly from some spreadsheet programs, including Microsoft Excel. However, this requires additional software to be installed with the SAS system, and it does not work for all spreadsheet programs. General methods for transferring data between spreadsheets and SAS will be discussed later in the next session.


List input

So far, all of the examples have used column input. This requires the data to be arranged into neat columns, and it is especially helpful for reading character data with blanks (such as San Antonio or JC Penney) or strings of values which are written without spaces. However, it is not always necessary to use column input. For example, if your data contain only numbers with a few significant digits and character variables with eight or fewer characters and no blanks, then the variables may be listed in the input statement without specifying column numbers. Blank spaces are assumed to separate the variables in a line, and any number of blank spaces can be used. An example is shown below.
DATA grades;
INPUT student $ quiz test project $ absences;
DATALINES;
Ann 84 90 A- 0
Bill    78   84 B 0
Cathy   95 89   A 1
David 84 88 B+ 1
;
This would not work correctly with the following data:

Delimited files

With list input, blank spaces are delimiters, or special characters used to separate the values of variables in a line. SAS can also interpret other characters as delimiters. For example, suppose that the dataset in the previous example was stored in the text file A:\GRADES.TXT as follows:
Ann/84/90/A-/0
Bill/78/84/B/0
Cathy/95/89/A/1
David/84/88/B+/1
Then, the following statements could be used to create the GRADES dataset.
DATA grades;
INFILE 'a:\grades.txt' delimiter='/';
INPUT name $ quiz test project $ absences;
The phrase 'dlm=' can be used in place of 'delimiter='. This option is used if a keyboard character, such as a comma, slash, or asterisk, separates values in a line. Of course, the character used to separate variables should not appear within a data value. For example, in a comma-delimited file, the number 125,000 would have to be written as 125000; otherwise, SAS would try to break it apart into two variables with values 125 and 000.

Tabs are an exception to the use of the DELIMITER option. In a tab-delimited file, the EXPANDTABS option replaces the DELIMITER option. In the example, if tabs had been used in place of slashes, the proper statement for reading the data would be:

INFILE 'a:\grades.txt' expandtabs;

Column pointers

Suppose that the data set with student grades is stored in A:\GRADES.TXT as follows:
Ann   84 90 A- 0
Bill  78 84 B  0
Cathy 95 89 A  1
David 84 88 B+ 1
Creating a dataset from this file would be easy to read with column input and even easier with list input. However, suppose that you only needed to use the students' names and project grades. You can use column pointers to skip over undesired data. Column pointers use the @ symbol to tell SAS to begin reading data at a specified column. In this example, we need to know that the names start in Column 1; project grades, in Column 13. The following statements could be used.
DATA grades;
INFILE 'a:\grades.txt';
INPUT @1 name $ @13 project $;
The quiz grade, test grade, and absences do not appear in this dataset. As shown above, column pointers can be used to skip over unneeded data.

Mixed input

You may occasionally find it necessary or convenient to use a combination of input techniques for a particular dataset. For example, suppose that the dataset of grades appears as follows:
Ann       84 90 A- 0
Bill      78 84 B 0
Catherine 95 89 A 1
David     84 88 B+  1
Recall that list input can be used only when character variables have 8 or fewer characters, with no blanks. Catherine has 9 letters, so simple list input cannot be used. However, the absences are not neatly aligned in a column, and the last four variables would be easy to read with list input. The following statements could be used:
DATA grades;
INFILE 'a:\grades.txt';
INPUT name $ 1-9 quiz test project $ absences;
It is also possible to use list input, column input, and column pointers simultaneously. For example, if you only needed the name, project grade, and absences, you could use the following INPUT statement:
INPUT name $ 1-9 @17 project $ absences;

Line pointers

So far, all of the data for each observation have appeared in one line. You may occasionally encounter data in which the variables for one observation appear in two or more consecutive lines, as shown below:
Ann       
84 90 A- 0
Bill   
78 84 B 0
Cathy     
95 89 A 1
David     
84 88 B+ 1
You may need to use a line pointer to read such data. A line pointer is like a column pointer, except that it specifies the line on which SAS should begin reading the data. A slash (/) tells SAS to skip to the next line, and #number tell SAS to go to that line of an observation's data to resume reading data. If the data above were stored in A:\GRADES.TXT, a dataset could be created in SAS as follows:
DATA grades;
INFILE 'a:\grades.txt';
INPUT name $ / quiz test project $ absences;
Equivalently, you could use the following INPUT statement:
INPUT name $ #2 quiz test project $ absences;
Since the data consist of simple characters and numbers, the following INPUT statement could also be used. Notice that there are no line pointers.
INPUT name $ quiz test project $ absences;
SAS will automatically go to the next line of data to complete the set of variables listed in the INPUT statement. However, if the data are irregular (missing values, blanks in character variables, etc.), then line pointers may be necessary.

Multiple observations on one line

If you have many observations with only a few variables, you may want to put several observations on one line of data. In SAS, you can do this by using double-trailing at signs (@@). This command tells SAS to hold its place if it has not yet reached the end of a line when reading data and to continue reading data from that line. The dataset of grades could be entered as shown below.
DATA grades;
INPUT name $ quiz test project $ absences @@;
DATALINES;
Ann 84 90 A- 0 Bill 78 84 B 0 Cathy 95 89 A 1 
David 84 88 B+ 1
;

Homework problems for this lesson

Return to STA 5106 home page