Assignments for Lesson 10

1. Refer to the GRADES data. Suppose that the instructor of the class wants to see if students performed at consistent levels during the semester. There would be a problem with the grading procedure if, for example, students who earned high grades at the beginning of the semester tended to have lower grades toward the end of the semester, or if students who performed well in one week performed poorly the next week. One way to evaluate this consistency numerically is to use a split-half reliability coefficient. Choose one way to divide the 13 homework grades into two groups: one with seven assignments, one with six assignments. For example, you may choose to divide the assignments into early and late assignments, odd-numbered and even-numbered assignments, or a randomly chosen group of seven and the remaining six assignments. Then, calculate the correlation between the total of the first group of assignments and the total of the second group of assignments. Typically, for a grading procedure to be considered "reliable," this correlation should be 0.7 or higher. Would you conclude that the grading policy is reliable from your calculations?

2. Refer to the IRIS data. Find all of the pairwise correlations among sepal length, sepal width, petal length, and petal width, using all of the observations. Now, calculate the averages of sepal length, sepal width, petal length, and petal width separately for each of the three species. You can do this within PROC UNIVARIATE by listing all four variables in the VAR statement, then use an OUTPUT statement similar to this:

OUTPUT OUT=IRISAVGS MEAN=AVGSL AVGSW AVGPL AVGPW;

Next, find the correlations among the four averages. For example, one of these is the correlation between the average sepal length and the average sepal width, calculated from three data points: one from setosa averages, one from versicolor averages, and one from virginica averages. The first correlations are called total correlations, while the correlations between averages are called between-groups correlations. These two types of correlations are used in Fisher's linear discriminant analysis to find mathematical expressions which describe the differences among the three species.

3. Sanford Weisberg listed the following example in his book Applied Linear Regression, Second Edition (New York, Wiley, 1985). Consider the following data.



  X   Y1    Y2    Y3     
 10  8.04  9.14  7.46              
  8  6.95  8.14  6.77             
 13  7.58  8.74 12.74               
  9  8.81  8.77  7.11               
 11  8.33  9.26  7.81             
 14  9.96  8.10  8.84           
  6  7.24  6.13  6.08           
  4  4.26  3.10  5.39            
 12 10.84  9.13  8.15             
  7  4.82  7.26  6.42              
  5  5.66  4.74  5.73             

On three separate scatterplots, plot Y1, Y2, and Y3 on the vertical axis versus X on the horizontal axis. Then, calculate the linear regression line to model Y1 using X as a predictor. Do the same for Y2 using X and for Y3 using X. Write down your estimates from the regression equations in the following table:

Response variable   Intercept   Slope   R2   Mean square error
       Y1
       Y2
       Y3

Some researchers tend to use R2 as the sole criterion of the goodness of fit of a regression line. Based on the table above, what would you tell them about R2 ? Write down your answer.

4. I was trying out a new bread recipe the other day. I spilled something on the recipe booklet, and I can't read how much flour I'm supposed to use in the recipe. I do know that I need to use 1 cup of water, 2 tablespoons of oil, 2 tablespoons of sugar, 1 1/2 teaspoons of salt, and 2 1/4 teaspoons of yeast.

Help me out. Refer to the BREAD data. Find the least-squares regression equation to predict flour amounts from water, oil, sugar, salt, and yeast, and use that equation to estimate how much flour I need in my recipe. Make sure that SAS prints the estimated amount of flour needed.

5. Refer to the USEDCARS data. Calculate the regression line to predict the price of a used car based on the year in which it was manufactured. Obtain the residuals from this regression model, and use PROC UNIVARIATE to examine their distribution. In the PROC UNIVARIATE output, identify the largest five and smallest five residuals by the name of the used car dealer. Do the residuals appear to be normally distributed, as we assume when conducting t- and F-tests?

6. Refer to the HANKS data. During Tom Hanks's career, he has played both humorous and dramatic roles. Over time, has he increasingly accepted serious roles over lighter ones? To see if this is true, find the correlations of the length of the movie, the drama rating, and the humor rating with year. (The ratings are ordinal, so the correlations of year with humor and drama must not be interpreted too rigidly. Instead, they give a rough indication of positive or negative trends with time.)

7. Refer to the LIMES data. Suppose that a citrus grower ships limes to be included in holiday gift baskets. To prevent bruising, the limes to be shipped are placed in plastic trays with individual compartments for each fruit. These compartments can be manufactured so that each lime is placed either on end or lengthwise in the compartment. The grower does not want the limes to roll around in the box, so the goal is to choose the tray orientation which would snugly hold more limes in place.

A statistician might solve this problem by determining whether the variance of the lime diameters is significantly larger or smaller than the variance of the lime lengths. Then, the orientation with lower variance would be the better orientation for packing. To test this hypothesis, we can use Pitman's test. Calculate the sum of the fruit diameter and the fruit length, calculate the fruit diameter minus the fruit length, and find the correlation between that sum and that difference. The implied results are shown below.

Based on the p-value for the correlation, what would you conclude? Write down your answer.

8. Refer to the CLINTON data. Using all of the polls taken since 1993, calculate the number of days elapsed between successive polls and the number points that the approval rating changed in that time. For example, from January 24 to January 29, 1993, the elapsed time was 5 days and the change in approval rating was -4 points. Be sure to assign negative values to the "change" variable if the approval rating decreased and positive values if it increased. Then, find the correlation between the elapsed time in days and the change in approval rating. Also, calculate the correlation between the current approval rating and the number of days since the last poll was taken. Would you conclude that the Gallup organization has conducted its polls more frequently when President Clinton's approval ratings have been unusually high, or low, or when they have changed dramatically?


Return to STA 5106 home page