Semiparametric Bayesian Analysis of Matched Case-Control Studies with
Missing Exposure
We consider Bayesian analysis of matched case-control problems when one of the
covariates is partially missing. The standard approach of conditional
logistic regression fails to work efficiently in the presence of missing
exposure variable. The present work develops a likelihood based
approach modeling the distribution of the partially missing exposure.
Within the likelihood context, the standard approach to this problem
is to posit a fully parametric model among the controls for the
partially missing covariate as a function of the covariates in the model
and the variables making up the strata: sometimes, the stratum effects are
ignored at this stage. Our approach differs not only in that it is
Bayesian, but far more importantly in the manner in which we treat the
strata effects. In the matched case control study with no missing
data, the strata are treated completely nonparametrically. Our
approach is a Bayesian version of this common frequentist device: we
assume a Dirichlet process prior with a base measure normal
distribution for the stratum effects and estimate all the parameters in a
Bayesian framework. Since the posterior distributions of the
parameters are in non-standard form we use MCMC technique to estimate
the parameters. Two matched case-control examples and a simulation
study are considered to illustrate our methods and the computing scheme.
We extend our methods to the situations when the disease has multiple
categories, when one has a set of mutually associated exposures, or when the
exposures are measured with errors.
Seminar page
Regression Models for Discrete-Valued Time Dependent Data
Independent random effects in generalized linear models induce an
exchangeable correlation structure, but long sequences of counts or binomial
observations typically show correlations decaying with increasing lag. This
talk introduces models with autocorrelated random effects for a more
appropriate, parameter driven analysis of discrete-valued time series data.
We present a Monte Carlo EM algorithm with Gibbs sampling to jointly obtain
maximum likelihood estimates of regression parameters and variance
components. Marginal mean, variance and correlation properties of the
conditionally specified models are derived for Poisson, negative binomial and
binary/binomial random components. Estimation of the joint probability of two
or more events is possible and used in predicting future responses. Also, all
methods are flexible enough to allow for multiple gaps or missing
observations in the observed time series. The approach is illustrated with a
time series of 168 monthly counts of polio infections and a long binary time
series.
Seminar page
Smoothing Functional Data for Cluster Analysis
Cluster analysis is an important exploratory tool for analyzing many
types of data. In particular, we explore the problem of clustering
functional data, which arise as curves, characteristically observed as
part of a continuous process. We examine the effect of smoothing such
data on dissimilarity estimation and cluster analysis. We prove that a
shrinkage method of smoothing results in a better estimator of the
dissimilarities among a set of noisy curves. Strong empirical evidence
is given that smoothing functional data before clustering results in a
more accurate grouping than clustering the observed data without
smoothing. An example involving yeast gene expression data illustrates
the technique. (This is Ph.D. dissertation research directed by James
Booth and George Casella.)
Seminar page
Measurement Error Models in Small Area Estimation
This work looks at the problem of estimation in the small area setup where
the covariates are measured with error. In other words, it considers the role
of measurement error models in small area estimation.
We consider simultaneous estimation of finite population means for several
strata based on two different model structures and assumptions. In each
consideration,a model-based approach is taken, where the covariates in the
super-population model are subject to measurement errors. In the
first set-up, EB estimators of the strata means are developed
and an asymptotic expression of the Mean Square Error of the vector of EB
estimators is attained. In the second set-up, we consider developing both EB
and HB estimators of the strata means. In both cases, findings
are supported by appropriate data analyses and are further validated by
simulation studies.
Seminar page
t vs. T2 in Simultaneous Inference
There is probably no study that does not require
multiple hypothesis testing. It is well known that in a general
multi-parameter setting, there may not exist any unique best test.
More importantly, unlike the univariate case, the power of
different test procedures could vary remarkably in multivariate
analysis. In this talk we discuss three methods of combining
dependent univariate tests for multivariate location problem. A
Monte Carlo study indicate that in many cases the powers of the
combination methods are much better than the Hotelling type tests.
Relationship is established between Fisher's method of combining
tests and a new class of tests that have best average power for
multivariate linear hypotheses. Furthermore, two step-up
simultaneous tests regarding the number of true null hypothesis
are proposed. It is shown that each procedure controls the
familywise error rate in the strong sense. Applications in
microarray analysis, QTL detection, and composite analysis of
clinical trials are also considered.
Seminar page
A Mixture Representation of the Stationary Distribution
When a Markov chain satisfies a minorization condition, its stationary
distribution can be represented as an infinite mixture. The
distributions in the mixture are associated with the hitting times on
an accessible atom introduced via the splitting construction of
Athreya and Ney (1978). This mixture representation is closely
related to perfect sampling and has applications in Markov chain Monte
Carlo.
(This is joint work with Christian Robert, Universite Paris
Dauphine.)
Seminar page
Wavelet Kernel Penalized Estimation for Non-equispaced Design
Regression
The talk considers regression problems with univariate design
points. The design points are irregular and no assumptions on
their distribution are imposed. The regression function is
retrieved by a wavelet based reproducing kernel Hilbert space
(RKHS) technique with the penalty equal to the sum of blockwise
RKHS norms. In order to simplify numerical optimization, the
problem is replaced by an equivalent quadratic minimization
problem with an additional penalty term. The computational
algorithm is described in detail and is implemented with both the
sets of simulated and real data. Comparison with existing methods
showed that the technique suggested in the paper does not
oversmooth the function and is superior in terms of the mean
squared error. It is also demonstrated that under additional
assumptions on design points the method achieves asymptotic
optimality in a wide range of Besov spaces.
This is joint work with U. Amato (CNR, Naples, Italy) and
A. Antoniadis (Laboratoire IMAG-LMC, University Joseph Fourier, France).
Seminar page
Functional Convex Averaging for Time-Warped Random Curves and
Its Application to Clustering Temporal Gene Expression Data
Data often arise as a sample of curves in science and engineering. When the
dynamics of development, growth or response over time are at issue, subjects
or experimental units may experience events at a different temporal pace. For
functional data where trajectories may be individually time-transformed, it is
usually inadequate to use commonly employed sample statistics such as the
cross-sectional mean. One may then consider subjecting each observed curve to
a time transformation in an attempt to reverse the warping of the time scale,
prior to further statistical analysis. Dynamic time warping, alignment, curve
registration and landmark-based methods have been put forward with the goal of
finding adequate empirical time transformations.
Previous analyses of warping have typically not been based on a model where
individual observed curves are viewed as realizations of a stochastic process.
We propose a functional convex synchronization model, under the premise that
each observed curve is the realization of a stochastic process. Monotonicity
constraints on time evolution provide the motivation for a functional convex
calculus with the goal of obtaining sample statistics such as a functional
mean. Observed random functions in warped time space are represented by a
bivariate random function in synchronized time space, consisting of a
stochastic monotone time transformation function and an unrestricted random
amplitude function. This leads to the definition of a functional convex
average or "longitudinal average", which is in contrast to the conventional
"cross-sectional" average. We derive a functional limit theorem and asymptotic
confidence intervals for functional convex means. The results are illustrated
with a novel time warping transformation. The methods are applied to simulated
data and the Berkeley growth data. This nonparametric time-synchronized
algorithm is also combined with an iterative mean updating technique to find
an overall representation that corresponds to a mode of a sample of gene
expression profiles, viewed as a random sample in function space.
This talk is based on joint works with Dr. Hans-Georg Müller.
Seminar page
Inference with Monte Carlo Data: A Paradox of "Knowing Too Much"?
(Or "How to cure our schizophrenia?")
In the past half century or so, physicists, computer scientists,
statisticians and many others have made tremendous advances in
designing efficient Monte Carlo algorithms. In contrast, methods
for Monte Carlo estimation, namely, using the simulated
data generated by these algorithms to estimate quantities of interest,
are almost exclusively based on the most primitive estimation technique,
that is, by taking sample average or its simple variations
(e.g., importance sampling). So what happened to all these wonderful
estimation methods, such as maximum likelihood
and Bayesian methods? Given these methods are so powerful for
analyzing real data where the underlying true model is at best partially
known, why are they not used for analyzing simulated data, where the
underlying model is completely known (at least in principle)?
Based on a recent paper by Kong, McCullagh, Meng, Nicolae, and Tan
(Journal of The Royal Statistical Society, 2003, 585-618), this talk
demonstrates that a satisfactory answer to such questions not only satisfies
our philosophical curiosity, but more importantly
it can lead to Monte Carlo estimators with efficiency that are generally
unaware of. In particularly we give a practical example where the new
Monte Carlo estimator converges at the super fast 1/n rate instead of the
usual 1/sqrt(n) rate, where n is the size of the simulation data.
Seminar page
A General Approach to Mixed Effects Modeling of Residual Variances
in Generalized Linear Mixed Models
We propose a mixed effects approach to structured heteroskedastic error
modeling for generalized linear mixed models in which linked functions of
subject-specific means and residual variances are each specified as separate
linear combinations of fixed and random effects. We focus on the linear
mixed model (LMM) analysis of Gaussian data and the cumulative probit mixed
model (CPMM) analysis of ordinal data. All analyses were based on the use of
Markov chain Monte Carlo methods with each model based on a Bayesian
hierarchical construction. The deviance information criterion (DIC) was
demonstrated to be useful in correctly choosing between homoskedastic and
heteroskedastic error GLMM for both traits when data was generated according
to a mixed model specification for both location parameters and residual
variances. Heteroskedastic error LMM and CPMM were fitted, respectively, to
birthweight (BW) and calving ease (CE) data on calves from Italian Piemontese
first parity dams. For both traits, residual variances were modeled as
functions of fixed calf sex and random herd effects. The posterior mean
residual variance for male calves was over 40% greater than that for female
calves for both traits. Also, the posterior means of the coefficient of
variation of herd-specific variance ratios were estimated to be
0.60±0.09 for BW and 0.74±0.14 for CE. For both traits, the
heteroskedastic error LMM and CPMM were chosen over their homoskedastic error
counterparts based on DIC values. The benefits of heavy-tailed
(e.g. Student t) specifications for structured heteroskedastic error
models are also briefly illustrated.
Seminar page
Semiparametric Bayesian Inference Based on AFT Models
An Accelerated Failure Time (AFT) semiparametric regression model for
censored data is proposed as an alternative to the widely used proportional
hazards survival model. The proposed regression model for censored data turns
out to be flexible and practically meaningful. Features include physical
interpretation of the regression coefficients through the mean response time
instead of the hazard functions. It is shown that the regression model
obtained by a mixture of parametric families, has a proportional mean
structure. The statistical inference is based on a nonparametric Bayesian
approach that uses a Dirichlet process prior for the mixing distribution.
Consistency of the posterior distribution of the regression parameters in the
Euclidean metric is established under certain conditions. Finite sample
parameter estimates along with associated measure of uncertainties can be
computed by a MCMC method. Simulation studies are presented to provide
empirical validation of the new method. Some real data examples are provided
to show the easy applicability of the proposed method.
(joint work with Subhasis Ghosal, NCSU)
Seminar page
Functional Mapping: Towards High-Dimensional Biology
Many complex traits inherently undergo remarked developmental changes during
ontogeny. Traditional mapping approaches that analyze phenotypic data measured
at a single point are too simple to take into account such a high-dimensional
biological issue. We have developed a general framework, called functional
mapping, in which the foundation is established for mapping quantitative trait
loci (QTL) that underlie variation in a complex trait of dynamic feature.
Functional mapping provides a useful quantitative and testable framework for
assessing the interplay between gene action, development, sex, genetic
background and environment.
Seminar page
Objective Bayesian Analysis of Contingency Tables
The statistical analysis of contingency tables is typically carried out with a
hypothesis test. In the Bayesian paradigm, default priors for hypothesis
tests are typically improper, and cannot be used. Although such priors are
available, and proper, for testing contingency tables, we show that for
testing independence they can be greatly improved on by so-called intrinsic
priors.
We also argue that because there is no realistic situation that corresponds to
the case of conditioning on both margins of a contingency table, the proper
analysis of an a × b contingency table should only
condition on either the table total or on only one of the margins. The
posterior probabilities from the intrinsic priors provide reasonable answers
in these cases. Examples using simulated and real data are given.
Seminar page
Second Guessing Clinical Trial Designs
NB: This is work done jointly with Dr. Myron Chang
Suppose you, as a medical journal reviewer, read a report of a randomized
clinical trial that fails to mention any sequential monitoring plan. Can you
ask the "What if" question without obtaining the actual data? Surprisingly,
in many instances, the answer is yes. If the trial information accrues as
approximate Brownian Motion, then the joint predictive distribution of any
collection of effect size estimates at times before the final analysis depends
only upon the effect size estimate at the final analysis. Hence, you can
superimpose a hypothetical group sequential design upon the non-sequential
design. As a side benefit of this research, reference designs, with optimal
or near optimal expected sample sizes relative to single stage designs, are
presented so that as a reviewer you would simply consult a table to assess the
predictive probabilities for each stopping time and the conditional mean
sample size. In addition, you can superimpose your own group sequential design
upon an actual group sequential design to second guess what you think was a
poor choice of stopping boundaries. In this case, you need the Z-scores (or
single degree of freedom chi-squares) at each interim look. This new
capability should alter the mindset of those designing clinical trials. Since
journal editors and the public can subject the trial to close scrutiny after
the fact, trial designers will be more motivated to making their designs as
efficient as possible. We shall present two actual examples that were heavily
criticized for staying open too long. In one, it is probable that
participants were not properly protected, and a report of critically important
public health benefit was withheld unreasonably from the public. In the
other, the criticism seems to have been unfounded.
Seminar page
Statistics and Bioethics: A Necessary Integration
A fundamental ethical principle underlying medical research is that the
research be designed and conducted in a scientifically valid way. The
Declaration of Helsinki, an international statement of ethical principles for
medical research, includes as one of its articles, "Medical research
involving human subjects must conform to generally accepted scientific
principles, be based on a thorough knowledge of the scientific literature,
other relevant sources of information, and on adequate laboratory and, where
appropriate, animal experimentation." In addition to the
statistician's role in ensuring the validity of research design and
conduct (and thereby its ethical acceptability), statisticians are well
positioned to identify aspects of study designs that raise specific ethical
concerns, and to develop approaches that avoid such concerns. In this
presentation, I will review a number of ethical issues that have arisen
regarding the design and conduct of clinical trials, and discuss the role of
statisticians in addressing these issues. Particular issues for discussion
include interim data monitoring and early decision-making, use of placebo
controls and design of active control trials, randomized consent designs,
adaptive allocation to treatment arm, and trials in special populations.
Seminar page
Public Perceptions of Public Health: Challenges and Dilemmas
Public health policy-making is often complicated, involving the weighing of
risks and benefits that may not be as precisely defined as one might like,
economic issues, and, when the issue at hand is highly visible, anticipated
public perceptions. Public perceptions have become increasingly important in
an era of rapid and easy access to information (and misinformation) through
the World Wide Web, and can create substantial challenges when the relevant
issues are scientifically complex, the data are less than optimally
definitive, and/or when the stakes are high. For example, an individual with
a serious disease may learn about a promising new treatment that is early in
clinical development, and may naturally wish to gain access to that treatment,
even though little data are as yet available on its clinical effectiveness
and its risks. On the other hand, an individual who receives a treatment that
has been widely studied and has been deemed "safe and effective"
by regulatory authorities, but who then suffers a serious adverse effect of
the treatment that had not been recognized earlier as a potential rare risk,
may believe that the treatment was made available too early, and that
sufficient study should have been required to have identified this adverse
effect before the treatment was approved for use. The growth in public
advocacy groups is another factor in public health policy and communication.
Advocacy groups can be highly effective partners in setting policies and in
communicating their rationale; but they can also create obstacles if what
appears to public health officials to be appropriate policy is viewed
negatively by one or more groups. The complexity of many public health
issues can make it difficult to communicate effectively to the public
regarding the rationale for policies taken. A number of examples of these
challenges and dilemmas will be presented.
Seminar page
A Synthesis of Objective Bayesian and Designed Based Methods for
Finite Population Sampling
In the frequentist or design approach to finite population sampling prior
information is incorporated in the sampling design.
In the Bayesian approach prior information is incorporated through
a prior distribution. Since the posterior distribution does not depend
on the design it has been difficult to theoretically reconcile the
two approaches. The Polya posterior is an objective Bayesian approach
to finite population sampling that is appropriate when little
or no prior information is available. We will discuss
generalizations of the Polya posterior which allow one to objectively
incorporate into a Bayesian model some types of information
that are usually encapsulated in the design. Inferences which are
found through simulation from the resulting posteriors will typically
have good frequentist properties and yield sensible estimates of
variance.
Seminar page
Statistics and Six Sigma: An Explanation of Six Sigma for
Statisticians
Six Sigma is a widely popular process improvement methodology. In this
seminar, an introduction to Six Sigma will be presented. The professions of
Six Sigma Blackbelt and Master Blackbelt will be presented as options for
graduate degree statisticians. The statistical methods imbedded in the Six
Sigma DMAIC roadmap will be highlighted. In particular, the Gauge R&R
study will be examined. Opportunities for research in the realm of Gauge
R&R will be noted. The impact of measurement error on product
dispositioning will be demonstrated with simulations.
Seminar page
Profile Confidence Intervals for Contingency Table Parameters
A general method for computing profile confidence intervals for contingency
table parameters is described. The method, which is based on the theory of
multinomial-Poisson homogeneous models, lends itself to a general
computational algorithm and is applicable for a broad class of parameters.
The literature suggests that profile score and profile likelihood confidence
intervals generally have better coverage properties than their Wald
counterpart. Profile intervals have been used on a case-by-case basis for
several different contingency table parameters, e.g., odds ratio, relative
risk, and risk difference. These examples in the literature use a common
computational approach that has two main limitations: it is case-specific
and it is applicable for a restrictive class of parameters. The method
proposed in this presentation avoids these limitations. Examples of profile
confidence intervals for a variant of the gamma measure of association, a
mean, and a dispersion measure illustrate the method.
Seminar page