SAS Manual: Depertment of Statistics Webpage (www.stat.ufl.edu) => Computing Environment => Online Software Documentation => SAS manuals (UF viewing only) => SAS Procedures Guide => The UNIVARIATE Procedure (or The MEANS Procedure)=> Overview, Procedure Syntax, Examples.
Enough data manipulation! Let's do some statistics!
Univariate statistics are numerical summaries of data which describe features of the distribution. Some uses of these statistics are listed below:
The SAS procedures MEANS, SUMMARY, TABULATE, and UNIVARIATE can all calculate univariate statistics. Of these, UNIVARIATE is the most versatile and is the easiest to use.
Suppose that a vending machine which dispenses hot beverages is supposed to dispense 6 ounces of a drink into a styrofoam cup. However, there is some variation in the amount dispensed, since the machine could be affected by such factors as external temperatures, liquid temperatures, and amount of liquid remaining. The variation in amounts dispensed was studied by purchasing 50 cups of coffee from the machine and carefully measuring the amount of coffee dispensed to the nearest 0.1 oz. The SAS program below creates a dataset and applies PROC UNIVARIATE to the data, using several options.
DATA coffee; INPUT amount @@; DATALINES; 6 5.8 5.3 5.5 5.6 5.8 5.9 5.9 5.6 6.1 6 5.9 6.6 6 5.8 5.7 5.6 5.7 5.6 6.1 5.7 5.7 6 5.7 5.6 6 6.2 5.9 6.1 6.1 6 6.1 6.1 6.4 5.8 6.2 5.9 5.9 5.6 6.2 6 5.3 6.1 6 5.5 5.8 6 6.2 6 5.9 ; PROC UNIVARIATE DATA=coffee FREQ PLOT NORMAL; VAR amount; RUN;
The default output from PROC UNIVARIATE is shown below.
Univariate Procedure
Variable=AMOUNT
Moments
N 50 Sum Wgts 50
Mean 5.89 Sum 294.5
Std Dev 0.259709 Variance 0.067449
Skewness -0.03131 Kurtosis 0.415806
USS 1737.91 CSS 3.305
CV 4.409328 Std Mean 0.036728
T:Mean=0 160.3661 Pr>|T| 0.0001
Num ^= 0 50 Num > 0 50
M(Sign) 25 Pr>=|M| 0.0001
Sgn Rank 637.5 Pr>=|S| 0.0001
W:Normal 0.970376 Pr<W 0.3879
Quantiles(Def=5)
100% Max 6.6 99% 6.6
75% Q3 6.1 95% 6.2
50% Med 5.9 90% 6.2
25% Q1 5.7 10% 5.6
0% Min 5.3 5% 5.5
1% 5.3
Range 1.3
Q3-Q1 0.4
Mode 6
Percentiles indicate how many data values fall above and below a given number. The kth percentile is a number such that at least k% of the data are less than or equal to that number, and at least (100-k)% are greater than or equal to that number. MAX, the 100th percentile, is the largest value; MIN, the 0th percentile, is the smallest. The 50th percentile is also called the median; 25th percentile, first quartile; and 75th percentile, third quartile. The range is the difference between the maximum and minimum values, and Q3-Q1 is called the interquartile range. The mode is the data value which occurs most often in the data.
Extremes
Lowest Obs Highest Obs
5.3( 42) 6.2( 36)
5.3( 3) 6.2( 40)
5.5( 45) 6.2( 48)
5.5( 4) 6.4( 34)
5.6( 39) 6.6( 13)
SAS lists the five lowest and five highest values. Here, OBS refers to the number of the observation with that value. By adding an ID statement after the VAR statement, you can ask SAS to identify the high and low observations with the value of another variable. For exmaple, for following input
DATA test3; INPUT name $ x @@;cards; john 1 mary 2 ted 3 smith 4 kitty 5 sam 6 linclon 7 ; PROC UNIVARIATE; VAR x;ID name;will produce the following output:
Extremes
Lowest ID Highest ID
1(john ) 3(ted )
2(mary ) 4(smith )
3(ted ) 5(kitty )
4(smith ) 6(sam )
5(kitty ) 7(linclon )
The plots requested with the PLOT option are shown below.
Stem Leaf # Boxplot
66 0 1 |
65 |
64 0 1 |
63 |
62 0000 4 |
61 0000000 7 +-----+
60 0000000000 10 | |
59 0000000 7 *-----*
58 00000 5 | + |
57 00000 5 +-----+
56 000000 6 |
55 00 2 |
54 |
53 00 2 |
----+----+----+----+
Multiply Stem.Leaf by 10**-1
A stem-leaf plot is shown on the left. This is similar to a histogram; it shows the frequencies of observations in ordered categories. A data value of 5.9 is represented in the plot with 59 as the stem and 0 as the leaf. The statement below the plot shows that stem.leaf, or 59.0, should be multiplied by 10-1=0.1 to obtain the data value. The column headed by # shows the number of observations falling within each range.
The plot on the right is a boxplot. The box is outlined by the first and third quartiles. The line through the middle of the box represents the median; the plus mark, the mean. The vertical lines extending from the box indicate the spread of the data. SAS will indicate outliers, or values which lie beyond the whiskers, with 0 (moderate outlier) or with * (extreme outlier, for definition, see manual).
Normal Probability Plot
6.65+ *
| ++
| *++++
| ++++
| **+**
| *****
| ******++
| ****+++
| ***+++
| ***++
| *****+
| **+++
| +++
5.35+ *+++*
+----+----+----+----+----+----+----+----+----+----+
-2 -1 0 +1 +2
This plot can be used to see if the data follow a normal distribution. The + marks represent a straight line that should be followed by the data if they are indeed normally distributed; the * marks indicate the pattern actually followed by the data. (Also available are BETA, EXPONENTIAL, GAMMA, LOGNORNMAL, WEIBULL (see manula under HISTOGRAM statement and PROBPLOT Statement)
Percents
Percents
Value Count Cell Cum Value Count Cell Cum
5.3 2 4.0 4.0 6 10 20.0 74.0
5.5 2 4.0 8.0 6.1 7 14.0 88.0
5.6 6 12.0 20.0 6.2 4 8.0 96.0
5.7 5 10.0 30.0 6.4 1 2.0 98.0
5.8 5 10.0 40.0 6.6 1 2.0 100.0
5.9 7 14.0 54.0
The table above was produced by the FREQ option. In each row, the table shows each data value, the number of times that value occurred, that number of times expressed as a percentage, and the cumulative percentage of data values which fall at or below that value. For example, ten cups, or 20% of the sample values, dispensed 6.0 ounces. 74% of the cups dispensed dispensed no more than 6.0 ounces. Note: If you have a large dataset or a dataset with many unique values, you probably do not want to use the FREQ option, since this would produce a large unnecessary table.
PROC UNIVARIATE is useful for checking whether data or statistics calculated from sample data follow normal distributions. Many commonly-used statistical procedures, such as t- and F-tests for equality of means and confidence interval formulas, implicitly assume that the underlying distribution of data is normal. Use the following criteria furnished by PROC UNIVARIATE to determine whether the data are normal.
You may also use W and the test for normality, but remember that the test depends on the sample size. Small samples will usually not be declared significantly different from normal, even if there are outlying observations. Large samples will usually be declared nonnormal, even if the data follow a bell-shaped pattern. In the example above, the amounts of coffee dispensed seem to follow a normal distribution.
Suppose that the vending machine also dispenses hot cocoa, and you want to see if it is dispensed in the same amounts as coffee. PROC UNIVARIATE can be used to produce side-by-side boxplots of the data, as shown below.
DATA coffee;
INPUT amount @@;
DATALINES;
6 5.8 5.3 5.5 5.6 5.8 5.9 5.9 5.6 6.1 6 5.9 6.6 6 5.8
5.7 5.6
5.7 5.6 6.1 5.7 5.7 6 5.7 5.6 6 6.2 5.9 6.1 6.1 6 6.1
6.1 6.4
5.8 6.2 5.9 5.9 5.6 6.2 6 5.3 6.1 6 5.5 5.8 6 6.2 6
5.9
;
DATA cocoa;
INPUT amount @@;
DATALINES;
5.6 5.8 6.2 6 5.8 6.4 6 6.1 5.7 5.9 6 5.5 5.5 5.8 5.9
5.8 6.1
5.7 6.4 5.9 6.3 6.2 5.8 6.2 5.7 6.2 6.2 6.3 6 5.6 5.5
5.5 5.6
6 5.8 6 6.1 6.2 5.8 5.7 6.3 6 6.3 5.6 6.3 6.3 5.8 5.6
5.6 4.9
;
DATA hotdrink;
SET coffee(IN=i) cocoa(IN=j);
IF i=1 THEN beverage='Coffee';
ELSE IF j=1 THEN beverage='Cocoa ';
LABEL amount='amount in oz';
/* Remember to sort by the group variable before
analyzing groups separately! */
PROC SORT DATA=hotdrink;
BY beverage;
PROC UNIVARIATE DATA=hotdrink PLOT;
VAR amount;
BY beverage;
RUN;
The side-by-side boxplots are shown below. Cocoa and coffee are distributed in the same average amounts, but the amount of coffee dispensed is more variable than cocoa. Cocoa had one outlying value, perhaps because the machine ran out of cocoa on the last trial.
Variable=AMOUNT
amount in oz
6.6 + |
| |
| |
| |
6.4 + | |
| | |
| | |
| | |
6.2 + +-----+ |
| | | |
| | | +-----+
| | | | |
6 + | | | |
| | | | |
| *--+--* *--+--*
| | | | |
5.8 + | | | |
| | | | |
| +-----+ +-----+
| | |
5.6 + | |
| | |
| | |
| |
5.4 + |
| |
| |
|
5.2 +
|
|
|
5 +
|
| 0
|
4.8 +
------------+-----------+-----------
BEVERAGE Cocoa Coffee
When analyzing your own data, you may need to calculate summary statistics, then treat those statistics as data. For example, in an agricultural field experiment, you may want to average the yields from all of the plots for each fertilizer and irrigation method for presentation on a graph. PROC UNIVARIATE can be used to make new datasets consisting of summary statistics.
When creating output datasets using PROC UNIVARIATE, the NOPRINT option is helpful; it prevents SAS from printing unnecessary output. The OUTPUT statement is used to create the new dataset. In this statement, the name of the new dataset is given with OUT= . Then, a special SAS keyword is used to define the statistic of interest, followed by =, then the name of the variable that will contain this information. More keywords and variable names can be added in the same statement.
The following example shows how to calculate and print only the sample sizes, means, and standard deviations for both coffee and cocoa. The SAS keyword for the sample size is N; arithmetic mean, MEAN; and standard deviation, STD. Other useful keywords are MIN for minimum, MAX for maximum, SUM for the total, NMISS for the number of missing observations, MEDIAN for the median, Q1 for the first quartile (=25th percentile), and Q3 for the third quartile (=75th percentile). Not everyhting you see in the output can be transferred to the output dataset. Other keywords and statistics are listed in the help menus and in documentation for PROC UNIVARIATE.
PROC SORT DATA=hotdrink; BY beverage; PROC UNIVARIATE NOPRINT DATA=hotdrink; VAR amount; BY beverage; OUTPUT OUT=results N=number MEAN=avg_amt STD=std_dev; PROC PRINT DATA=results; RUN;
The dataset RESULTS is shown below.
OBS BEVERAGE NUMBER AVG_AMT STD_DEV 1 Cocoa 50 5.91 0.30656 2 Coffee 50 5.89 0.25971
You may need to attach these results back to the original dataset. For example, you may want to calculate the z-score for each observation. The z-score is equal to (original observation minus the sample mean) divided by the sample standard deviation. The MERGE and BY statements can be used here.
DATA bothsets; MERGE hotdrink results; BY beverage; z_score=(amount-avg_amt)/std_dev;
Remember that both datasets must be sorted in order by the BY variable before merging; here, the sorting was done in a previous step. The dataset BOTHSETS has the following variables: AMOUNT, BEVERAGE, NUMBER, AVG_AMT, STD_DEV, and the newly-calculated Z_SCORE.
PROC MEANS can also be used to print summary statistics and export them to new datasets. However, PROC MEANS cannot calculate some statistics, such as skewness, and it adds some extraneous variables to its output datasets. PROC UNIVARIATE does everything that PROC MEANS does and more, and it is easier to use when creating output datasets.