SOFTWARE SUPPLEMENT FOR CATEGORICAL DATA ANALYSIS

This supplement contains information about software for categorical data analysis and is intended to supplement the material in the second edition of Categorical Data Analysis, by Alan Agresti (Wiley, 2002).

SAS

See Appendix A of the Categorical Data Analysis text (2nd ed., 2002) for discussion, and go to For other examples of various analyses for some examples in this text and in my text "An Introduction to Categorical Data Analysis", see the useful site set up by the UCLA Statistical Computing Center. One procedure not discussed in the appendix of my text is SURVEYLOGISTIC for fitting binary and multiple-category logistic models by the method of pseudo maximum likelihood, incorporating the sample design into the analysis.

S-Plus

Dr. Laura Thompson has prepared an excellent, detailed manual (over 250 pages!!) on the use of S-Plus and R to conduct the analyses shown in this book. You can get a copy of this at Laura Thompson S manual for CDA. Thanks very much to Dr. Thompson for providing this very helpful resource.

Dr. Pat Altham at Cambridge also has a site that is a good source of examples for Splus and R.

For texts that contain examples of the use of S-Plus for various categorical data methods, see "Modern Applied Statistics With S-Plus," 3rd ed., by W. N. Venables and B. D. Ripley (Springer, 1999), "Analyzing Medical Data Using S-PLUS" by B. Everitt and S. Rabe-Hesketh (Springer, 2001), and "Regression Modeling Strategies" by F. E. Harrell (Springer, 2001).

BASIC CHI-SQUARED TESTS: S-Plus contains several functions for the basic inference methods of categorical data analysis. Examples include chisq.test(), fisher.test(), mantelhaen.test(), mcnemar.test().

GLM: The usual sorts of generalized linear models can be fitted with the glm() function. That function handles most of the models in the text. It can be used for such things as logistic regression, Poisson regression, and loglinear models. Specialized functions exist for particular methods, such as the loglin() function to fit loglinear models using proportional iterative fitting.

MULTINOMIAL MODELS: The glm function cannot handle multinomial models, but specialized functions have been written by various users. To fit baseline-category logit models, one can use the multinom() function from the library nnet that has been provided by Venables and Ripley to do various calculations by neural nets (see, e.g., p. 230 of Venables and Ripley, 3rd ed.). To fit the proportional odds model for ordinal responses, one can use the polr() function (proportional odds logistic regression) in the MASS library that is distributed with S-Plus (based on programs in the text by Venables and Ripley; see p. 231 of their 3rd edition for polr), and the function lrm in Frank Harrell's Design S-plus library (see also Harrell's text "Regression Modeling Strategies" mentioned above for discussion of fitting this model and the continuation-ratio logit model). See also the VGLM and VGAM packages developed by Thomas Yee at Auckland, New Zealand, which have functions that can also can fit a wide variety of other models including adjacent-categories models, continuation-ratio models, Goodman's RC association model, and bivariate logistic and probit models for bivariate binary responses.

GEE: The S archive at Statlib contains a function gee() for analyses using generalized estimating equations.

GLMM: Generalized linear mixed models (GLMMs) can be fitted with the penalized quasi-likelihood method using the glmmPQL() function developed by Brian Ripley for the MASS library. The function glmmNQ uses quadrature methods and the function GLMMgibbs on CRAN employs a fully Bayesian approach with Gibbs sampling.

LATENT CLASS: Steve Buyske at Rutgers has prepared a library for fitting latent class models with the EM algorithm.

NEG BINOMIAL: The S archive at Statlib contains a negbin() function for negative binomial regression.

HIGHER-ORDER ASYMPTOTICS: Alessandra Brazzale has prepared S functions for a variety of higher-order asymptotic analyses, including approximate conditional analysis for logistic and loglinear models.

Here are some very-lightly annotated examples of Splus sessions I have conducted with some of the examples in this text. I have not checked these in awhile, and there is no guarantee that all of them are correct!

R

R is free software maintained and regularly updated by a wide variety of volunteers. It is an open source version using the S programming language, and many S-Plus functions also work in R. For instance, the discussion above about various functions for categorical data methods also applies to R. For details, see the R web site . This includes a link to manuals, such as "An Introduction to R", and to the archives in the Comprehensive R Archive Network (CRAN).

As noted above, Dr. Laura Thompson has prepared a detailed manual on the use of R or S to conduct many of the analyses in this text. You can get a copy of this excellent resource at Laura Thompson R and S manual for CDA. Thanks very much to Dr. Thompson for providing this manual.

A useful site for learning R for those already familiar with SAS or SPSS is R for SAS and SPSS users, by Robert Muenchen.

Another good source about R functions for various basic types of categorical data analyses is material prepared by Brett Presnell. This site has details (for an introductory course on this topic at the University of Florida) for many of the examples in my lower-level text "An Introduction to Categorical Data Analysis."

Some of the useful functions for categorical data analysis, explained in detail at Presnell's website, are:

dbinom() and dpois() for binomial and Poisson probabilities; e.g., dbinom(6,10,.5) for outcome 6 in 10 trials with parameter .5.

prop.test() for a test and score CI for a binomial proportion; e.g., prop.test(6,10,p=.5), but note that the default uses a continuity correction; this can be turned off with correct=FALSE.

chisq.test() for chi-squared test

fisher.test() for Fisher's exact test

mantelhaen.test() for the Cochran-Mantel-Haenszel test

glm() for generalized linear models

mcnemar.test() for matched pairs

MULTINOMIAL MODELS: See also Thomas Yee at Auckland, New Zealand for VGLM and VGAM packages ordinal and other multinomial models (see above description under S-Plus), and Yudi Pawitan at the Karolinska Institute, Sweden. Also, see the Presnell website mentioned above for an example of fitting a model for baseline-category logits. See volume 14, issue 3, of Journal of Statistical Software for a R package by K. Imai and D. A. van Dyk for Bayesian fitting of the cumulative probit model. The GNM add-on package for R, developed by David Firth and Heather Turner at the Univ. of Warwick, can fit multiplicative models such as Goodman's RC model for two-way contingency tables and Anderson's stereotype model for ordinal multinomial responses.

GLMM: Functions have been developed for specialized methods such as GLMM by various people. For instance, Brian Ripley's function glmmPQL uses penalized quasi likelihood for fitting GLMMs, and apparently a MCMC approach for Bayesian fitting of such models has been prepared by Myles and Clayton (GlmmGibbs).

GENERALIZED LOGLINEAR MODELS: As mentioned in the text appendix, Joseph Lang at the University of Iowa (e-mail jblang@stat.uiowa.edu) has powerful R functions for fitting generalized loglinear models and mean response models by ML. The former class includes many of the standard marginal models of interest for repeated measurement. At his home page, he currently has these available in an R function, mph.fit, that can fit these and other "multinomial-Poisson homogeneous models for contingency tables" described in a 2004 paper by Lang in the Annals of Statistics (vol. 32, pp. 340-383).

BRADLEY-TERRY: Prof. David Firth at the University of Warwick has prepared a R package (BradleyTerry) available at CRAN that is designed to fit the Bradley-Terry model and versions of it whereby ability scores are described by a linear predictor (see also volume 12, issue 01 at Journal of Statistical Software). For an overview of what this package can do, you can also go to Dr. Firth's web site.

ITEM RESPONSE THEORY MODELS: Dimitris Rizopoulos from Leuven, Belgium has prepared a package `ltm' for Item Response Theory analyses. This package can fit the Rasch model, the two-parameter logistic model, Birnbaum's three-parameter model, the latent trait model with up to two latent variables, and Samejima's graded response model. See Rwiki page of `ltm').

MULTIPLICATIVE MODELS: The GNM add-on package for R, developed by David Firth and Heather Turner at the Univ. of Warwick, can fit multiplicative models such as Goodman's RC model for two-way contingency tables and Anderson's stereotype model for ordinal multinomial responses.

CORRESPONDENCE ANALYSIS: The site Correspondence Analysis with R accompanies the 2005 book by Fionn Murtagh on this topic.

CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS IN 2x2 AND 2xc TABLES: For a binomial proportion and for parameters comparing two binomial proportions such as the difference of proportions, relative risk, and odds ratio, a good general-purpose method for constructing confidence intervals is to invert the score test. Such intervals are not available in the standard software packages. Here are R functions for confidence intervals for a proportion, R functions for confidence intervals comparing two proportions with independent samples, and R functions for confidence intervals comparing two proportions with dependent samples. These sites also contain R functions for some "exact" small-sample intervals that guarantee at least the nominal coverage probability (such as the Clopper-Pearson and Blaker confidence intervals for a proportion) and adjustments of the Wald interval. Most of these were written by my former graduate student, Yongyi Min. The confidence intervals for a proportion include the mid-P adaptation of the Clopper-Pearson interval (written by Anna Gottard, Univ. of Firenze).

Yongyi Min has also prepared some R functions for Bayesian confidence intervals for 2x2 tables using independent beta priors for two binomial parameters, for the difference of proportions, odds ratio, and relative risk. (These are evaluated and compared to score confidence intervals in a 2005 article in the journal Biometrics by Agresti and Min.)

Euijung Ryu (a former PhD student of mine who is now at Mayo Clinic) has prepared R functions for various confidence intervals for the ordinal measure [P(Y1 > Y2) + (1/2)P(Y1 = Y2)] that is useful for comparing two multinomial distributions on an ordinal scale. Here is a pdf file of CIs for ordinal effect measure, including simple methods as well as score and profile likelihood intervals (which require using Joe Lang's mph.fit function). Euijung has also prepared R functions for multiple comparisons of proportions with independent samples using simultaneous confidence intervals for the difference of proportions or the odds ratio, based on the studentized-range inversion of score tests proposed by Agresti, Bini, Bertaccini, and Ryu in the journal Biometrics, 2008.

Please quote this site if you use one of these R functions for confidence intervals for association parameters. We believe these functions are dependable, but no guarantees or support are available, however, so use them at your own risk.

Stata

A listing of the extensive selection of categorical data methods available in Stata is given in Table 3 of the article by R. A. Oster in the August 2002 issue of The American Statistician (pp. 243-244); the main focus of that article is on methods for small-sample exact analysis. For information about Stata (including its use for complex methods such as generalized linear mixed models and GEE), see "A Handbook of Statistical Analyses Using Stata," 3rd ed., by S. Rabe-Hesketh and B. Everitt, CRC Press, 2003. For examples of categorical data analyses for many data sets in the first edition of my text "An Introduction to Categorical Data Analysis", see the useful site set up by the UCLA Statistical Computing Center. For a pdf file on using GEE in Stata, see GEE pdf file.

SPSS (version 15)

CONTINGENCY TABLES:

The DESCRIPTIVE STATISTICS option on the ANALYZE menu has a suboption called CROSSTABS, which provides several methods for contingency tables. After identifying the row and column variables in CROSSTABS, clicking on STATISTICS provides a wide variety of options, including the chi-squared test and measures of association. The output lists the Pearson statistic, its degrees of freedom, and its P-value (labeled Asymp. Sig.). If any expected frequencies in a 2x2 table are less than 5, Fisher's exact test results. It can also be requested by clicking on Exact in the CROSSTABS dialog box and selecting the exact test. SPSS also has an advanced module for small-sample inference (called SPSS Exact Tests) that provides exact P-values for various tests in CROSSTABS and NPAR TESTS procedures. For instance, the Exact Tests module provides exact tests of independence for r x c contingency tables with nominal or ordinal classifications. See the publication "SPSS Exact Tests 15.0 for Windows."

In CROSSTABS, clicking on CELLS provides options for displaying observed and expected frequencies, as well as the standardized residuals, labeled as "Adjusted standardized". Clicking on STATISTICS in CROSSTABS provides options of a wide variety of statistics other than chi-squared, including gamma and Kendall's tau-b. The output shows the measures and their standard errors (labeled Asymp. Std. Error), which you can use to construct confidence intervals. It also provides a test statistic for testing that the true measure equals zero, which is the ratio of the estimate to its standard error. This test uses a simpler standard error that only applies under independence and is inappropriate for confidence intervals. One option in the list of statistics, labeled Risk, provides as output the odds ratio and its confidence interval.

Suppose you enter the data as cell counts for the various combinations of the two variables, rather than as responses on the two variables for individual subjects; for instance, perhaps you call COUNT the variable that contains these counts. Then, select the WEIGHT CASES option on the DATA menu in the Data Editor window, instruct SPSS to weight cases by COUNT.

GLMs and LOGISTIC REGRESSION:

To fit generalized linear models, on the ANALYZE menu select the GENERALIZED LINEAR MODELS option and the GENERALIZED LINEAR MODELS suboption. Select the Dependent Variable and then the Distribution and Link Function. Click on the Predictors tab at the top of the dialog box and then enter quantitative variables as Covariates and categorical variables as Factors. Click on the Model tab at the top of the dialog box and enter these variables as main effects, and construct any interactions that you want in the model. Click on OK to run the model.

To fit logistic regression models, on the ANALYZE menu select the REGRESSION option and the BINARY LOGISTIC suboption. In the LOGISTIC REGRESSION dialog box, identify the binary response (dependent) variable and the explanatory predictors (covariates). Highlight variables in the source list and click on a*b to create an interaction term. Identify the explanatory variables that are categorical and for which you want dummy variables by clicking on Categorical and declaring such a covariate to be a Categorical Covariate in the LOGISTIC REGRESSION: DEFINE CATEGORICAL VARIABLES dialog box. Highlight the categorical covariate and under Change Contrast you will see several options for setting up dummy variables. The Simple contrast constructs them as in this text, in which the final category is the baseline.

In the LOGISTIC REGRESSION dialog box, click on Method for stepwise model selection procedures, such as backward elimination. Click on Save to save predicted probabilities, measures of influence such as leverage values and DFBETAS, and standardized residuals. Click on Options to open a dialog box that contains an option to construct confidence intervals for exponentiated parameters.

Another way to fit logistic regression models is with the GENERALIZED LINEAR MODELS option and suboption on the ANALYZE menu. You pick the binomial distribution and logit link function. It is also possible there to enter the data as the number of successes out of a certain number of trials, which is useful when the data are in contingency table form. One can also fit such models using the LOGLINEAR option with the LOGIT suboption in the ANALYZE menu. One identifies the dependent variable, selects categorical predictors as factors, and selects quantitative predictors as cell covariates. The default fit is the saturated model for the factors, without including any covariates. To change this, click on Model and select a Custom model, entering the predictors and relevant interactions as terms in a customized (unsaturated) model. Clicking on Options, one can also display standardized residuals (called adjusted residuals) for model fits. This approach is well suited for logit models with categorical predictors, since standard output includes observed and expected frequencies. When the data file contains the data as cell counts, such as binomial numbers of successes and failures, one weights each cell by the cell count using the WEIGHT CASES option in the DATA menu.

MULTINOMIAL RESPONSES and LOGLINEAR MODELS:

SPSS can also fit logistic models for categorical response variables having several response categories. On the ANALYZE menu, choose the REGRESSION option and then the ORDINAL suboption for a cumulative logit model. Select the MULTINOMIAL LOGISTIC suboption for a baseline-category logit model. In the latter, click on Statistics and check Likelihood-ratio tests under Parameters to obtain results of likelihood-ratio tests for the effects of the predictors.

For loglinear models, one uses the LOGLINEAR option with GENERAL suboption in the ANALYZE menu. One enters the factors for the model. The default is the saturated model, so click on Model and select a Custom model. Enter the factors as terms in a customized (unsaturated) model and then select additional interaction effects. Click on Options to show options for displaying observed and expected frequencies and adjusted residuals. When the data file contains the data as cell counts for the various combinations of factors rather than as responses listed for individual subjects, weight each cell by the cell count using the WEIGHT CASES option in the DATA menu.

GLIM

See first edition of "Categorical Data Analysis" (1990) for several GLIM examples, as well as the 2005 text by Aitkin, Francis, and Hinde on "Statistical Modeling in GLIM4" (Oxford) and Jim Lindsey's 1989 text on "The Analysis of Categorical Data Using GLIM" (Springer-Verlag). See Statlib for an archive of GLIM macros. Also, Rory Wolfe has prepared macros for cumulative link models.

StatXact and LogXact

StatXact (Cytel Software, Cambridge MA) provides exact analysis for categorical data methods and some nonparametric methods. Among its procedures are small-sample confidence intervals for differences and ratios of proportions and for odds ratios, and Fisher's exact test and its generalizations for IxJ tables. It also can conduct exact tests of conditional independence and of equality of odds ratios in 2x2xK tables, and exact confidence intervals for the common odds ratio in several 2x2 tables. StatXact uses Monte Carlo methods to approximate exact P-values and confidence intervals when a data set is too large for exact inference to be computationally feasible. Its companion LogXact performs exact conditional logistic regression. The President of Cytel Software is Dr. Cyrus Mehta, who has been one of the most active researchers in the past 20 years in advancing the development of algorithms for conducting small-sample inference for categorical data. For a brief survey of the capability of these packages, see the article by R. A. Oster in the August 2002 issue of The American Statistician (pp. 243-244).

Others

SUDAAN provides analyses for categorical and continuous data from stratified multi-stage cluster designs. It has facility (MULTILOG procedure) for GEE analyses of marginal models for nominal and ordinal responses. See SUDAAN GEE.

Robert Newcombe at the University of Wales in Cardiff provides an Excel spreadsheet for forming various confidence intervals for a proportion and for comparing two proportions with independent or with matched samples. His website also has SPSS and Minitab macros for doing this.

Berger and Boos at North Carolina State University provide the Berger - Boos test and other small-sample unconditional tests for 2-by-2 tables.

Latent Gold is the website for the Latent Gold program (marketed by Statistical Innovations of Belmont, MA) for fitting a wide variety of latent class models.

Pesarin and Salmaso give a variety of permutation analyses for categorical and continuous variables, including some multivariate analyses, using a SAS macro constructed by Luigi Salmaso at the University of Padova.

For a survey of software for implementing the GEE method, see the article by Horton and Lipsitz in The American Statistician, 1999, vol. 53, pp. 160-169.


Copyright © 2008, Alan Agresti, Department of Statistics, University of Florida.