SOFTWARE SUPPLEMENT FOR CATEGORICAL DATA ANALYSIS
This supplement contains information about software for categorical
data analysis and is intended to supplement the material in the second
edition of Categorical Data Analysis, by Alan Agresti (Wiley, 2002).
SAS
See Appendix A of the Categorical Data Analysis text (2nd ed., 2002)
for discussion, and go to - examples for illustrations of SAS
for data sets in this text. See also the references of SAS
publications in that Appendix.
For other examples of various analyses for some examples in this text
and in my text "An Introduction to Categorical Data Analysis", see
the useful site set up by
the UCLA
Statistical Computing Center. One procedure not discussed in the
appendix of my text
is SURVEYLOGISTIC
for fitting binary and multiple-category logistic models by the
method of pseudo maximum likelihood, incorporating the sample design
into the analysis.
S-Plus
Dr. Laura Thompson has prepared an excellent, detailed manual (over
250 pages!!) on the use of S-Plus and R to conduct the analyses shown
in this book. You can get a copy of this at Laura Thompson S
manual for CDA. Thanks very much to Dr. Thompson for providing
this very helpful resource.
Dr. Pat Altham at
Cambridge also has a site that is a good source of examples for Splus
and R.
For texts that contain examples of the use of S-Plus for various
categorical data methods, see "Modern Applied Statistics With S-Plus,"
3rd ed., by W. N. Venables and B. D. Ripley (Springer, 1999),
"Analyzing Medical Data Using S-PLUS" by B. Everitt and
S. Rabe-Hesketh (Springer, 2001), and "Regression Modeling Strategies"
by F. E. Harrell (Springer, 2001).
BASIC CHI-SQUARED TESTS: S-Plus contains several functions for the
basic inference methods of categorical data analysis. Examples
include chisq.test(), fisher.test(), mantelhaen.test(),
mcnemar.test().
GLM: The usual sorts of generalized linear models can be fitted with
the glm() function. That function handles most of the models in the
text. It can be used for such things as logistic regression, Poisson
regression, and loglinear models. Specialized functions exist for
particular methods, such as the loglin() function to fit loglinear
models using proportional iterative fitting.
MULTINOMIAL MODELS: The glm function cannot handle multinomial models,
but specialized functions have been written by various users. To fit
baseline-category logit models, one can use the multinom() function
from the library nnet that has been provided by Venables and Ripley to
do various calculations by neural nets (see, e.g., p. 230 of Venables
and Ripley, 3rd ed.). To fit the proportional odds model for ordinal
responses, one can use the polr() function (proportional odds logistic
regression) in the MASS library that is distributed with S-Plus (based
on programs in the text by Venables and Ripley; see p. 231 of their
3rd edition for polr), and the function lrm in Frank Harrell's Design
S-plus library (see also Harrell's text "Regression Modeling
Strategies" mentioned above for discussion of fitting this model and
the continuation-ratio logit model). See also the VGLM and VGAM
packages developed by Thomas Yee at Auckland,
New Zealand, which have functions that can also can fit a wide variety
of other models including adjacent-categories models,
continuation-ratio models, Goodman's RC association model, and
bivariate logistic and probit models for bivariate binary responses.
GEE: The S archive at Statlib
contains a function gee() for analyses using generalized estimating
equations.
GLMM: Generalized linear mixed models (GLMMs) can be fitted with the
penalized quasi-likelihood method using the glmmPQL() function
developed by Brian Ripley for the MASS library. The function glmmNQ
uses quadrature methods and the function GLMMgibbs on CRAN employs a
fully Bayesian approach with Gibbs sampling.
LATENT CLASS: Steve Buyske at Rutgers has prepared a library for
fitting latent class
models with the EM algorithm.
NEG BINOMIAL: The S archive
at Statlib contains a negbin() function for negative binomial
regression.
HIGHER-ORDER ASYMPTOTICS: Alessandra Brazzale has prepared S functions
for a variety of higher-order
asymptotic analyses, including approximate conditional analysis
for logistic and loglinear models.
Here are some very-lightly annotated examples of Splus sessions I have
conducted with some of the examples in this text. I have not checked
these in awhile, and there is no guarantee that all of them are
correct!
- Chi-squared and loglinear analyses of 2x3 table on gender and
political party affiliation (Table 3.11)
- Loglinear and logit models for 2x2x2 table of death
penalty (Table 2.6)
- Linear-by-linear association model and row effects model and
column effects model for 4x4 table of opinions about birth control
and premarital sex (Table 9.3)
R
R is free software maintained and regularly updated by a wide variety
of volunteers. It is an open source version using the S programming
language, and many S-Plus functions also work in R. For instance, the
discussion above about various functions for categorical data methods
also applies to R. For details, see the R web site . This includes a
link to manuals, such as "An Introduction to R", and to the archives
in the Comprehensive R Archive Network (CRAN).
As noted above, Dr. Laura Thompson has prepared a detailed manual on
the use of R or S to conduct many of the analyses in this text. You
can get a copy of this excellent resource at Laura Thompson R
and S manual for CDA. Thanks very much to Dr. Thompson for
providing this manual.
A useful site for learning R for those already familiar with SAS or
SPSS is R for
SAS and SPSS users, by Robert Muenchen.
Another good source about R functions for various basic types of
categorical data analyses is material prepared by Brett
Presnell. This site has details (for an introductory course on
this topic at the University of Florida) for many of the examples in
my lower-level text "An Introduction to Categorical Data Analysis."
Some of the useful functions for categorical data analysis, explained
in detail at Presnell's website, are:
dbinom() and dpois() for binomial and Poisson probabilities; e.g.,
dbinom(6,10,.5) for outcome 6 in 10 trials with parameter .5.
prop.test() for a test and score CI for a binomial proportion; e.g.,
prop.test(6,10,p=.5), but note that the default uses a continuity
correction; this can be turned off with correct=FALSE.
chisq.test() for chi-squared test
fisher.test() for Fisher's exact test
mantelhaen.test() for the Cochran-Mantel-Haenszel test
glm() for generalized linear models
mcnemar.test() for matched pairs
MULTINOMIAL MODELS: See also Thomas Yee at Auckland,
New Zealand for VGLM and VGAM packages ordinal and other multinomial
models (see above description under S-Plus), and Yudi Pawitan at the
Karolinska Institute, Sweden. Also, see the Presnell website
mentioned above for an example of fitting a model for
baseline-category logits. See volume 14, issue 3, of Journal of Statistical Software
for a R package by K. Imai and D. A. van Dyk for Bayesian fitting of
the cumulative probit model. The GNM
add-on package for R, developed by David Firth and Heather Turner at
the Univ. of Warwick, can fit multiplicative models such as Goodman's
RC model for two-way contingency tables and Anderson's stereotype
model for ordinal multinomial responses.
GLMM: Functions have been developed for specialized methods such as
GLMM by various people. For instance, Brian Ripley's function glmmPQL
uses penalized quasi likelihood for fitting GLMMs, and apparently a
MCMC approach for Bayesian fitting of such models has been prepared by
Myles and Clayton (GlmmGibbs).
GENERALIZED LOGLINEAR MODELS: As mentioned in the text appendix, Joseph Lang at the
University of Iowa (e-mail jblang@stat.uiowa.edu) has powerful R
functions for fitting generalized loglinear models and mean response
models by ML. The former class includes many of the standard marginal
models of interest for repeated measurement. At his home page, he
currently has these available in an R function, mph.fit, that can fit
these and other "multinomial-Poisson homogeneous models for
contingency tables" described in a 2004 paper by Lang in the
Annals of Statistics (vol. 32, pp. 340-383).
BRADLEY-TERRY: Prof. David Firth at the University of Warwick has
prepared a R package (BradleyTerry) available at CRAN that is designed to fit
the Bradley-Terry model and versions of it whereby ability scores are
described by a linear predictor (see also volume 12, issue 01 at Journal of Statistical
Software). For an overview of what this package can do, you can
also go to
Dr. Firth's web site.
ITEM RESPONSE THEORY MODELS: Dimitris Rizopoulos from Leuven, Belgium
has prepared a package `ltm' for Item Response Theory analyses. This
package can fit the Rasch model, the two-parameter logistic model,
Birnbaum's three-parameter model, the latent trait model with up to
two latent variables, and Samejima's graded response model. See Rwiki
page of `ltm').
MULTIPLICATIVE MODELS: The GNM
add-on package for R, developed by David Firth and Heather Turner at
the Univ. of Warwick, can fit multiplicative models such as Goodman's
RC model for two-way contingency tables and Anderson's stereotype
model for ordinal multinomial responses.
CORRESPONDENCE ANALYSIS: The
site Correspondence
Analysis with R accompanies the 2005 book by Fionn Murtagh on this
topic.
CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS IN 2x2 AND 2xc TABLES: For a
binomial proportion and for parameters comparing two binomial
proportions such as the difference of proportions, relative risk, and
odds ratio, a good general-purpose method for constructing confidence
intervals is to invert the score test. Such intervals are not
available in the standard software packages. Here are R functions for
confidence intervals for a proportion, R functions for confidence
intervals comparing two proportions with independent samples, and
R functions for confidence
intervals comparing two proportions with dependent samples. These
sites also contain R functions for some "exact" small-sample intervals
that guarantee at least the nominal coverage probability (such as the
Clopper-Pearson and Blaker confidence intervals for a proportion) and
adjustments of the Wald interval. Most of these were written by my
former graduate student, Yongyi Min. The confidence intervals for
a proportion include the mid-P adaptation of the Clopper-Pearson
interval (written by Anna Gottard, Univ. of Firenze).
Yongyi Min has also prepared some R functions for Bayesian
confidence intervals for 2x2 tables using independent beta priors
for two binomial parameters, for the difference of proportions, odds
ratio, and relative risk. (These are evaluated and compared to score
confidence intervals in a 2005 article in the journal Biometrics by
Agresti and Min.)
Euijung Ryu (a former PhD student of mine who is now at Mayo Clinic)
has prepared R functions for various confidence intervals for the
ordinal measure [P(Y1 > Y2) + (1/2)P(Y1 = Y2)] that is useful for
comparing two multinomial distributions on an ordinal scale. Here is
a pdf file
of CIs
for ordinal effect measure, including simple methods as well as
score and profile likelihood intervals (which require using Joe Lang's
mph.fit function). Euijung has also prepared R functions
for
multiple comparisons of proportions with independent samples using
simultaneous confidence intervals for the difference of proportions or
the odds ratio, based on the studentized-range inversion of score
tests proposed by Agresti, Bini, Bertaccini, and Ryu in the journal
Biometrics, 2008.
Please quote this site if you use one of these R functions for
confidence intervals for association parameters. We believe these
functions are dependable, but no guarantees or support are available,
however, so use them at your own risk.
Stata
A listing of the extensive selection of categorical data methods
available in Stata is given in Table 3 of the article by R. A. Oster
in the August 2002 issue of The American Statistician (pp. 243-244);
the main focus of that article is on methods for small-sample exact
analysis. For information about Stata (including its use for
complex methods such as generalized linear mixed models and GEE),
see "A Handbook of Statistical Analyses Using Stata," 3rd ed., by
S. Rabe-Hesketh and B. Everitt, CRC Press, 2003. For examples of
categorical data analyses for many data sets in the first edition of
my text "An Introduction to Categorical Data Analysis", see the
useful site set up by
the UCLA
Statistical Computing Center. For a pdf file on using GEE in Stata,
see GEE pdf
file.
SPSS (version 15)
CONTINGENCY TABLES:
The DESCRIPTIVE STATISTICS option on the ANALYZE menu has a suboption
called CROSSTABS, which provides several methods for contingency
tables. After identifying the row and column variables in CROSSTABS,
clicking on STATISTICS provides a wide variety of options, including
the chi-squared test and measures of association. The output lists
the Pearson statistic, its degrees of freedom, and its P-value
(labeled Asymp. Sig.). If any expected frequencies in a 2x2 table are
less than 5, Fisher's exact test results. It can also be requested by
clicking on Exact in the CROSSTABS dialog box and selecting the exact
test. SPSS also has an advanced module for small-sample inference
(called SPSS Exact Tests) that provides exact P-values for various
tests in CROSSTABS and NPAR TESTS procedures. For instance, the Exact
Tests module provides exact tests of independence for r x c
contingency tables with nominal or ordinal classifications. See the
publication "SPSS Exact Tests 15.0 for Windows."
In CROSSTABS, clicking on CELLS provides options for displaying
observed and expected frequencies, as well as the standardized
residuals, labeled as "Adjusted standardized". Clicking on STATISTICS
in CROSSTABS provides options of a wide variety of statistics other
than chi-squared, including gamma and Kendall's tau-b. The output
shows the measures and their standard errors (labeled
Asymp. Std. Error), which you can use to construct confidence
intervals. It also provides a test statistic for testing that the
true measure equals zero, which is the ratio of the estimate to its
standard error. This test uses a simpler standard error that only
applies under independence and is inappropriate for confidence
intervals. One option in the list of statistics, labeled Risk,
provides as output the odds ratio and its confidence interval.
Suppose you enter the data as cell counts for the various combinations
of the two variables, rather than as responses on the two variables
for individual subjects; for instance, perhaps you call COUNT the
variable that contains these counts. Then, select the WEIGHT CASES
option on the DATA menu in the Data Editor window, instruct SPSS
to weight cases by COUNT.
GLMs and LOGISTIC REGRESSION:
To fit generalized linear models, on the ANALYZE menu select the
GENERALIZED LINEAR MODELS option and the GENERALIZED LINEAR MODELS
suboption. Select the Dependent Variable and then the Distribution
and Link Function. Click on the Predictors tab at the top of the
dialog box and then enter quantitative variables as Covariates and
categorical variables as Factors. Click on the Model tab at the top
of the dialog box and enter these variables as main effects, and
construct any interactions that you want in the model. Click on OK to
run the model.
To fit logistic regression models, on the ANALYZE menu select the
REGRESSION option and the BINARY LOGISTIC suboption. In the LOGISTIC
REGRESSION dialog box, identify the binary response (dependent)
variable and the explanatory predictors (covariates). Highlight
variables in the source list and click on a*b to create an
interaction term. Identify the explanatory variables that are
categorical and for which you want dummy variables by clicking on
Categorical and declaring such a covariate to be a Categorical
Covariate in the LOGISTIC REGRESSION: DEFINE CATEGORICAL VARIABLES
dialog box. Highlight the categorical covariate and under Change
Contrast you will see several options for setting up dummy variables.
The Simple contrast constructs them as in this text, in which
the final category is the baseline.
In the LOGISTIC REGRESSION dialog box, click on Method for
stepwise model selection procedures, such as backward elimination.
Click on Save to save predicted probabilities, measures of
influence such as leverage values and DFBETAS, and standardized
residuals. Click on Options to open a dialog box that contains
an option to construct confidence intervals for exponentiated
parameters.
Another way to fit logistic regression models is with the GENERALIZED
LINEAR MODELS option and suboption on the ANALYZE menu. You pick the
binomial distribution and logit link function. It is also possible
there to enter the data as the number of successes out of a certain
number of trials, which is useful when the data are in contingency
table form. One can also fit such models using the LOGLINEAR option
with the LOGIT suboption in the ANALYZE menu. One identifies the
dependent variable, selects categorical predictors as factors, and
selects quantitative predictors as cell covariates. The default fit
is the saturated model for the factors, without including any
covariates. To change this, click on Model and select a Custom model,
entering the predictors and relevant interactions as terms in a
customized (unsaturated) model. Clicking on Options, one can also
display standardized residuals (called adjusted residuals) for model
fits. This approach is well suited for logit models with categorical
predictors, since standard output includes observed and expected
frequencies. When the data file contains the data as cell counts,
such as binomial numbers of successes and failures, one weights each
cell by the cell count using the WEIGHT CASES option in the DATA
menu.
MULTINOMIAL RESPONSES and LOGLINEAR MODELS:
SPSS can also fit logistic models for categorical response variables
having several response categories. On the ANALYZE menu, choose the
REGRESSION option and then the ORDINAL suboption for a cumulative
logit model. Select the MULTINOMIAL LOGISTIC suboption for a
baseline-category logit model. In the latter, click on
Statistics and check Likelihood-ratio tests under Parameters to
obtain results of likelihood-ratio tests for the effects of the
predictors.
For loglinear models, one uses the LOGLINEAR option with GENERAL
suboption in the ANALYZE menu. One enters the factors for the model.
The default is the saturated model, so click on Model and select
a Custom model. Enter the factors as terms in a customized
(unsaturated) model and then select additional interaction effects.
Click on Options to show options for displaying observed and
expected frequencies and adjusted residuals. When the data file
contains the data as cell counts for the various combinations of
factors rather than as responses listed for individual subjects,
weight each cell by the cell count using the WEIGHT CASES option in
the DATA menu.
GLIM
See first edition of "Categorical Data Analysis" (1990) for several
GLIM examples, as well as the 2005 text by Aitkin, Francis,
and Hinde on "Statistical Modeling in GLIM4" (Oxford) and Jim Lindsey's
1989 text on "The Analysis of Categorical Data Using GLIM"
(Springer-Verlag). See
Statlib for an archive of GLIM macros. Also,
Rory Wolfe has prepared macros for cumulative link models.
StatXact and LogXact
StatXact (Cytel Software,
Cambridge MA) provides exact analysis for categorical data methods and
some nonparametric methods. Among its procedures are small-sample
confidence intervals for differences and ratios of proportions and for
odds ratios, and Fisher's exact test and its generalizations for IxJ
tables. It also can conduct exact tests of conditional independence
and of equality of odds ratios in 2x2xK tables, and exact confidence
intervals for the common odds ratio in several 2x2 tables. StatXact
uses Monte Carlo methods to approximate exact P-values and confidence
intervals when a data set is too large for exact inference to be
computationally feasible. Its companion LogXact performs exact
conditional logistic regression. The President of Cytel Software is
Dr. Cyrus Mehta, who has been one of the most active researchers in
the past 20 years in advancing the development of algorithms for
conducting small-sample inference for categorical data. For a brief
survey of the capability of these packages, see the article by
R. A. Oster in the August 2002 issue of The American Statistician
(pp. 243-244).
Others
SUDAAN provides analyses for
categorical and continuous data from stratified multi-stage cluster
designs. It has facility (MULTILOG procedure) for GEE analyses of
marginal models for nominal and ordinal responses. See SUDAAN
GEE.
Robert
Newcombe at the University of Wales in Cardiff provides an Excel
spreadsheet for forming various confidence intervals for a proportion
and for comparing two proportions with independent or with matched
samples. His website also has SPSS and Minitab macros for doing this.
Berger and
Boos at North Carolina State University provide the Berger - Boos test
and other small-sample unconditional tests for 2-by-2 tables.
Latent
Gold is the website for the Latent Gold program (marketed by
Statistical Innovations of Belmont, MA) for fitting a wide variety of
latent class models.
Pesarin
and Salmaso give a variety of permutation analyses for categorical
and continuous variables, including some multivariate analyses, using
a SAS macro constructed by Luigi Salmaso at the University of Padova.
For a survey of software for implementing the GEE method, see the
article by Horton and Lipsitz in The American Statistician, 1999,
vol. 53, pp. 160-169.
Copyright © 2008, Alan Agresti, Department of Statistics,
University of Florida.