Dirichlet Process Priors for Bayesian Models with an Application to Bureaucratic Politics.
In this paper, we advocate greater attention to the use of prior specifications in the statistical
analysis of social science data, recommend a specific nonparametric variant based on a mixture of
Dirichlet processes that has not yet been applied in the social sciences, and develop a new Bayesian
stochastic simulation procedure to handle the resulting estimation challenges. We apply the Dirichlet
process prior to a Bayesian hierarchical model for ordered choices to address a difficult question in
bureaucratic politics. These Dirichlet mixtures represent a new paradigm for semi-informed prior
information that reflects both information from observations and researcher intuition, where neither
dominates. (This is joint work with G. Casella.)
Seminar page
Beyond Generalized Linear Models.
We introduce joint generalized linear models, hierarchical generalized linear models, and double
hierarchical generalized linear models, by adding new features in the original generalized linear models.
We discuss how these features in models can be used for the analysis of data.
Seminar page
Optimal Inferences for Proportional Hazards Model with Parametric Covariate Transformations.
The traditional Cox proportional hazards model assumes a log-linear relationship between covariates and
the underlying hazard function. However, in real data, this linear relationship may be invalid. One
example is data from a cancer clinical trial, National Surgical Adjuvant Breast and Bowel Project (NSABP)
study (B-20), in which the relationship between log hazard and progesterone receptor level is nonlinear
when analyzing disease free survival. We propose a generalized Cox model which allows parametric covariate
transformations to recover the linearity. However, while the proposed generalization may seem rather simple,
the inferential issues are quite challenging because of loss of identifiability under the null of no effects
of transformed covariates. Optimal tests are derived for such null hypothesis for certain alternatives.
Rigorous inferences for parameters and unspecified baseline hazard function are established when regularity
conditions hold and the transformed covariates have non-zero effects. The estimates and tests perform well
in simulation studies with realistic sample size, and the proposed tests are generally more powerful than the
usual partial likelihood ratio test with fixed transformations or sup partial likelihood ratio test.
Protocol B-20 data is used to illustrate the model building procedure and the better fit of the proposed model,
comparing to traditional Cox model.
Seminar page
The Parameter Cascade Method and its Applications in Estimating Air Pollution Models, Fitting HIV
Dynamical Models, and Constructing Gene Regulation Network.
Many statistical models involve three levels of parameters: (i) nuisance parameters are parameters required
to construct models but not of direct interest, (ii) structural parameters are parameters holding our main
concern, and (iii) complexity parameters are parameters controlling the effective degrees of freedom of the
models. The current methods for estimating these models are computational intensive and inoperable for
"naive" users. In our talk, we will introduce a new method, the parameter cascade method, which estimates
parameters in three nested levels of optimization and defines the nuisance parameters as regularized functions
of structural parameters, and structural parameters in turn as functions of complexity parameters. This
approach has several unique aspects. Firstly, the computation is fast and stable with the gradients and
Hessian matrices worked out analytically. Secondly, the unconditional variance estimates are attained, which
include the uncertainty coming from other parameter estimates. Finally, each level allows for a different
optimization criterion, otherwise the biases in parameter estimation may be induced.
The parameter cascade method will be illustrated by estimating generalized semiparametric additive models for
air pollution data, fitting HIV dynamical models (ordinary differential equations) to clinical trials, and
constructing gene regulation networks from time course microarray data.
Seminar page
Improving the Efficiency of the Logrank Test Using Auxiliary Covariates.
The logrank test is widely used in many clinical trials for comparing the survival distribution between
two treatments with censored survival data. Under the assumption of proportional hazards, it is optimal
for testing the null hypothesis of H0 : β = 0, where β
denotes the logarithm of the hazard ratio. In practice, additional auxiliary covariates are collected
together with the survival times and treatment assignment. If the covariates correlate with survival times,
making use of their information will increase the efficiency of the logrank test. In this paper, we apply
the theory of semiparametrics to characterize a class of regular and asymptotic linear estimators for
β when auxiliary covariates are incorporated into the model, and derive estimators that are more
efficient. The Wald tests induced by these estimators are shown to be more powerful than the logrank test.
Simulation studies and a real data from ACTG 175 are used to illustrate the gains in efficiency.
Seminar page
Financial Data and the Hidden Semimartingale Model.
The availability of high frequency data for financial instruments has opened the possibility of accurately
determining volatility in small time periods, such as one day. Recent work on such estimation indicates
that it is necessary to analyze the data with a hidden semimartingale model, typically by the addition of
measurement error. We review the emerging theory on this subject, including two- and multiscale sampling.
We also consider broader error schemes, through Markov kernels and such phenomena as rounding due to
discreteness of prices. Finally, we discuss the possibility of adapting likelihood theory to inference
problems of this type.
Seminar page
Higher-Order Accurate Nonparametric Function Estimation.
Higher-order accurate nonparametric estimation via infinite-order kernels will be discussed with special
attention given to polyspectra estimation in Time Series and hazard function estimation in Survival Analysis.
A novel bandwidth selection algorithm and an interesting connection between polyspectra and group
representations will also be addressed.
Seminar page
Variable Selection Using Single Index Models for Motif Discovery.
Information for regulating a gene's transcription is contained in the conserved patterns (motifs)
on the upstream/downstream DNA sequence (promoter region) close to the target gene. By combining
the information contained in both gene expression measurements and genes' promoter sequences, we
proposed a novel procedure for identifying functional active motifs under certain stimuli. A
nonlinear regression model, single index model, was used to associate promoter sequence
information of a gene and its mRNA expression measurements. Single index models postulate that
the response variable y depends on a unique linear combination of predictors X
through an unknown link function ƒ : y =
ƒ(X β, ε), where
β is an index vector and ε represents measurement errors.
In this talk, we will
describe computationally efficient variable selection procedures and criteria, which were developed
by us under profile likelihood frameworks for the single index model. We will also demonstrate the
advantage of these methods both theoretically and empirically. Compared with existing methods, our
proposed procedures can greatly improve variable selection sensitivities and specificities.
Seminar page
Bayesian k-nearest neighbour classification.
The k-nearest neighbour procedure uses a training dataset
{(y1,x1),
. . . , (yn,xn)}
to make predictions on new unlabelled data, where
yi ∈ {C1,
. . . , CG}
denotes the class label of the i-th point and xi
denotes a vector of p predictor variables. The prediction for a new point
(yn+1,xn+1)
is reported as the most common class found amongst the k-nearest
neighbours of xn+1 in the set
{x1, . . . ,
xn}.
The neighbours of a point are defined via a distance metric
ρ(xn+1,xi),
which is commonly taken to be the Euclidean norm. The k-nearest-neighbour
algorithm is a nonparametric procedure. Traditionally, the value of k is
chosen by minimizing the cross-validated misclassification rate. Here we propose
some alternative approaches using probability models and Bayesian perspectives.
A novel perfect sampling algorithm for resolving difficulties linked with
MRF constants is introduced on the way. (This is joint work with G. Celeux,
J.-M. Marin and D.M. Titterington.)
Seminar page
Gauss-Seidel Estimation of Generalized Linear Mixed Models with Application to Poisson
Modeling of Spatially Varying Disease Rates.
Generalized linear mixed models (GLMMs) provide an elegant framework for the analysis of correlated data.
Due to the non-closed form of the likelihood, GLMMs are often fit by computational procedures like penalized
quasi-likelihood (PQL). Special cases of these models are generalized linear models (GLMs), which are often
fit using algorithms like iterative weighted least squares (IWLS). High computational costs and memory space
constraints often make it difficult to apply these iterative procedures to data sets with very large number
of cases.
We propose a computationally efficient strategy based on the Gauss-Seidel algorithm that iteratively fits
sub-models of the GLMM to subsetted versions of the data. Additional gains in efficiency are achieved for
Poisson models, commonly used in disease mapping problems, because of their special collapsibility property
which allows data reduction through summaries. The strategy is applied to investigate the relationship between
ischemic heart disease, socioeconomic status and age/gender category in New South Wales, Australia, based on
outcome data consisting of approximately 33 million records.
Seminar page
Directional Regression for Dimension Reduction.
Dimensionality is a major concern in many modern statistical problems. In
regression analysis, dimension reduction means to reduce the dimension of predictors
without loss of information on the regression. Dimension reduction proves to be
particularly useful during the model development and criticism phases as it usually
does not require any pre-specified parametric models for regression. We propose a
Directional Regression (DR) method for dimension reduction. This novel method
naturally synthesizes dimension reduction methods based on first two conditional
moments, such as Sliced Inverse Regression (SIR) and Sliced Average Variance
Estimation (SAVE), and in doing so combines the advantages of these methods. Under
mild conditions, it provides exhaustive estimate of the Central Dimension Reduction
Subspace (CDRS). We also develop the asymptotic distribution of the Direction
Regression estimator, and therefore establish a sequential test procedure to
determine the dimension of the Central Dimension Reduction Subspace. The Directional
Regression is compared with existing methods via simulation. An application to a
handwritten digit recognition problem is also presented.
Seminar page
Objective Bayes Variable Selection: Some Methods and Some Theory.
A fully automatic Bayesian procedure for variable selection in normal regression model
has been developed, which uses the posterior probabilities of the models to drive a
stochastic search. The posterior probabilities are computed using intrinsic priors,
which are default priors as they are derived from the model structure and are free from
tuning parameters. The stochastic search is based on a Metropolis-Hastings algorithm
with a stationary distribution proportional to the model posterior probabilities. The
performance of the search procedure is illustrated on both simulated and real examples,
where it is seem to perform admirably. However, until recently such data-based evaluations
were the only performance evaluations available.
It has long been known that for the comparison of pairwise nested models, a decision based
on the Bayes factor produces a consistent model selector (in the frequentist sense), and we
are now able to extend this result and show that for a wide class of prior distributions,
including intrinsic priors, the corresponding Bayesian procedure for variable selection in
normal regression is consistent in the entire class of normal linear models. The asymptotics
of the Bayes factors for intrinsic priors are equivalent to those of the Schwarz (BIC) criterion,
and allow us to examine what limiting forms of proper prior distributions are suitable for
testing problems. Intrinsic priors are limits of proper prior distributions, and a consequence
of our results is that they avoid Lindley's paradox. The asymptotics further allow us to examine
some selection properties of the intrinsic Bayes rules, where it is seen that simpler models are
clearly preferred. (This is joint work with Javier Girón and Elías Moreno.)
Seminar page
Joint Models for the Association of Longitudinal Binary and Continuous Processes with
Application to a Smoking Cessation Trial.
Joint models for the association of a longitudinal binary and a longitudinal continuous process
are proposed for situations where their association is of direct interest. The models are
parameterized such that the dependence between the two processes is characterized by unconstrained
regression coefficients. Bayesian variable selection techniques are used to parsimoniously model
these coefficients. An MCMC sampling algorithm is developed for sampling from the posterior
distribution, using data augmentation steps to handle missing data. Several technical issues are
addressed to implement the MCMC algorithm efficiently. The models are motivated by, and are used
for, the analysis of a smoking cessation clinical trial in which an important question of interest
was the effect of the (exercise) treatment on the relationship between smoking cessation and weight
gain. (This is joint work with Xuefeng Liu.)
Seminar page
Gene Expression Quantitative Trait Loci Analysis.
Quantitative Trait Loci
(QTL) analysis has changed the study of genetic basis of complex traits.
With the availability of dense genome-wide molecular markers, it is now
possible to map many biologically, medically or agriculturally important
QTL into genomic regions for further study. However, there is a major
limitation for QTL study. The genomic regions for many mapped QTL are
too large for the identification of causal genes. Recently gene expression
microarray technology has been used in combination with molecular markers
to help for the gene identification and also for elucidating the genetic
pathways for complex traits. In this talk, we will outline the statistical
problems and discuss challenges for gene expression QTL analysis.
In particular, we will discuss our recent research results on multiple
interval mapping for eQTL and eQTL Viewer.
Seminar page
A Bayesian Approach to False Discovery Rate for Large Scale Simultaneous Inference.
Microarray data and other applications have inspired many recent developments in the area of large
scale inference. For microarray data, the number of tests of differential gene expression ranges
from 1,000 to 100,000. The traditional family-wise type I error rate (FWER) is over-stringent in this
context due to the large scale of simultaneity. More recently, false discovery rate (FDR), was defined
as the expected proportion of type I errors among the rejections. Controlling the less stringent FDR
criterion has less loss in detection capability than controlling the FWER and hence is preferable for
large scale multiple tests. From the Bayesian point of view, the posterior version of FDR and of false
nondiscovery rate (FNR) is easier to study. We study Bayesian decision rules to control Bayes FDR and
FNR. A hierarchical mixture model is developed to estimate the posterior probability of hypotheses. The
posterior distribution can also be used to estimate the false discovery percentage (FDP) defined as the
integrand of the FDR. The model in conjunction with Bayesian decision rules displays satisfying
performance in simulations and in the analysis of the Affymetrix HG-U133A spike-in data.
Seminar page
Methods for Evaluating and Correcting Selection Bias Using Two-stage Weighted Proportional Hazards Models.
In non-randomized biomedical studies using the proportional hazards model, the observed data often constitute
a biased (i.e., unrepresentative) sample of the underlying target population, resulting in biased regression
coefficients. The bias can be corrected by weighting included subjects by the inverse of their representative
selection probabilities, as proposed by Horvitz and Thompson (1952) and extended to the proportional hazards
setting for use in surveys by Binder (1992) and Lin (2000). The weights can be treated as fixed in cases where
they are known (e.g., chosen by the investigator) or based on voluminous data (e.g., a large-scale survey).
However, in many practical applications, the weights are estimated and must be treated as such in order for the
resulting inference to be accurate. We propose a two-stage weighted proportional hazards model in which, at the
first stage, weights are estimated through a logistic regression model fitted to a representative sample from the
target population. At the second stage, a weighted Cox model is fitted to the biased sample. We propose estimators
for the regression parameter and cumulative baseline hazard. Asymptotic properties of the parameter estimators
are derived, accounting for the difference in the variance introduced by the randomness of the weights. The
accuracy of the asymptotic approximations in finite samples is evaluated through simulation. The proposed
estimation methods are applied to kidney transplant data to quantify the true risk of graft failure associated
with expanded criteria donor (ECD). Although parameter estimation is consistent and potentially more efficient
using the proposed weighting method, computation is considerably more intensive than that for the unweighted model.
We therefore propose methods for evaluating bias in the unweighted partial likelihood and Breslow-Aalen estimators.
Asymptotic properties of the proposed test statistics are derived, and the finite-sample significance level and
power are evaluated through simulation. The proposed methods are then applied to data from a national organ failure
registry to evaluate the bias in a post kidney transplant survival model.
Seminar page
Stein Estimation and Prediction: A Synthesis.
Stein (1956, Proc. 3rd Berkeley Symposium), in his seminal paper, came
with the surprising discovery that the sample mean is an inadmissible
estimator of the population mean in three or higher dimensions under squared
error loss. The past five decades have witnessed multiple extensions and
variations of Stein's results. Extension of Stein's results to prediction
problems is of more recent origin, beginning with Komaki (2001, Biometrika),
George, Liang and Yu (2006, Annals of Statistics) and Ghosh, Mergel and
Datta (2006). The present article shows how both the estimation and prediction
problems go hand in hand under certain "intrinsic losses," which
includes both the Kullback-Leibler and Bhattacharyya-Hellinger divergence
losses. The estimators dominating the sample mean under such losses are
motivated both from the Bayesian and empirical Bayes point of view.
Seminar page
Statistical Challenges in Assessing the Relationship Between Environmental Impacts and Health Outcomes.
Citizens are increasingly interested in understanding how the environment
impacts their health. To address these concerns, a nationwide Environmental
Public Health Tracking program has been created. This, and many other efforts
to relate environmental and health outcomes, depend largely on the synthesis
of existing data sets; little new data are being collected for this purpose.
Generally, the environmental, health, and socio-demographic data needed in
such studies have been collected for different geographic or spatial units.
Further, the unit of interest may be different from the sampling units. Once
a common spatial scale has been established for the analysis, the question as
to how best to model the relationship between environmental impacts and public
health must be addressed. In this paper, these and other statistical challenges
of relating environmental impacts to public health will be discussed. Efforts
to model the relationship in myocardial infarction and air quality in Florida
will illustrate the challenges and potential solutions.
Seminar page
Statistical Methods in Functional Medical Imaging.
In this talk we consider statistical methods in functional
medical imaging. Functional imaging techniques, such as positron
emission tomography, single photon emission computed tomography and
functional magnetic resonance imaging, allow for in vivo imaging as the
human body functions. The unique nature of the high volume data
generated by these techniques gives rise to interesting statistical
problems. In this talk we will discuss two statistical methodologic
developments in functional imaging. In the first, we consider an
application in kinetic imaging of the colon to experimentally understand
the mechanics and penetration of a microbocide lubricant. A novel
application of fitting three dimensional statistical curves via a
modified principal curve algorithm is illustrated to solve the relevant
problem. The algorithm will be shown to be tested and debugged on a
battery of challenging two dimensional shapes. The second experiment
considers a functional magnetic resonance imaging study of pre-clinical
patients at high risk for Alzheimer's disease. Here a novel Bayesian
multilevel approach is used to consider paradigm-related connectivity
within and between anatomically defined regions of interest. Both
examples will be geared towards a general audience without specialized
background knowledge in medical imaging statistics.
Seminar page