Unconstrained Models for the Covariance Structure of Multivariate Longitudinal Data.
The constraint that a covariance matrix must be positive definite presents difficulties for modeling
its structure. In a series of papers published in 1999 and 2000, Mohsen Pourahmadi proposed a
parameterization of the covariance matrix for univariate longitudinal data in which the parameters
are unconstrained. This unconstrained parameterization is based on the modified Cholesky decomposition
of the inverse of the covariance matrix into a function of a unique unit lower triangular matrix with
no constraints on its non-trivial elements and a unique diagonal matrix with positive diagonal elements
on the diagonal. The positiveness constraint is removed by taking logarithms of the diagonal elements.
We extend this idea to multivariate longitudinal data.
We develop a modified Cholesky block decomposition that provides an unconstrained parameterization for
the covariance matrix, and we propose parsimonious models within this parameterization. A Newton-Raphson
algorithm is developed for obtaining maximum likelihood estimators of model parameters, assuming that
the observations are normally distributed. The results along with penalized likelihood criteria such
as BIC for model selection are illustrated using a real multivariate longitudinal dataset and a
simulated data set.
Regulatory versus Research Studies: A Statistical Perspective.
The pharmaceutical industry is a major source of career opportunities for statisticians in today's market.
A statistician in this industry has the potential to make significant contributions along the pathways of
discovery and development that, when traversed successfully, bring molecular entities to the marketplace.
The objectivity with which statisticians approach a problem can be particularly valuable in an industry
where so much is at stake based on the outcomes of studies. The role of the statistician varies at the
different stages of product development and may differ substantially from the role served on a research
study being carried out in a non-regulatory environment. A brief overview of the drug development process
will be provided in this seminar, with particular emphasis on the statistician's involvement in the process.
Aspects of study conduct, including data collection and management, clinical site monitoring, and data
structures needed to support product registration, will be described. Examples of statistical issues that
are frequently encountered in drug development and ways of addressing them will also be discussed briefly,
including multiplicity, missing data, and covariate adjustment. Career opportunities in the pharmaceutical
and related industries will be discussed, time permitting.
Statistical Methods for Mapping Reciprocal Effects and Application in /Zea mays L./
Reciprocal effects are due to effects of the parents (i.e. maternal and paternal effects), cytoplasmic
effects, and parent of origin effects. However, the extent to which reciprocal effects exist, or can
be attributed to specific underlying components, and mapped requires the development of new analytic
approaches. We develop a statistical analysis to identify and map contribution of specific nuclear
chromosomal regions to reciprocal effects. These methods are then applied to a case study in /Zea mays L./
Saddlepoint-Based Bootstrap Inference for Nonlinear Regression Models.
We propose a novel method for making small sample inference on the nonlinear parameter in a conditionally
linear nonlinear regression model. A parametric bootstrap method is developed where Monte Carlo simulation
is replaced by saddlepoint approximation. Saddlepoint approximations to the distribution of the estimating
equation whose unique root is the parameter's maximum likelihood estimator (MLE) are obtained, while
substituting conditional MLE's for the remaining (nuisance) parameters. A key result of Daniels (1983)
enables us to relate these approximations to those for the estimator of interest. The approach may also
be viewed as a form of bootstrap score inference, in which saddlepoint approximations for the distribution
of a score test statistic, under a family of bootstrap distributions instead of the asymptotic normal
distribution, are inverted to produce more accurate small sample confidence bounds. The method's performance
relies on a model reparametrization that orthogonalizes the nonlinear parameter with the nuisance parameters,
thereby also validating the substitution of conditional MLE's in for the latter. Confidence intervals produced
by the method are shown to have coverage errors of order O(n-1/2), with an
error rate that is reduced by the orthogonalizing parameterization. The methodology is shown to be
applicable also for inference on ratios of regression parameters in ordinary linear models. Simulations from
some celebrated examples show that the proposed method yields confidence intervals with lengths and coverage
probabilities that compare favorably with those from several competing methods.
Statistical Coding in Motor Cortex.
Effective neural motor prostheses require a method for decoding neural activity representing desired movement.
In particular, the accurate reconstruction of a continuous motion signal is necessary for the control of devices
such as computer cursors, robots, or a patient's own paralyzed limbs. In this talk, I will present our real-time
system for such applications that uses statistical Bayesian inference techniques to estimate hand motion from the
firing rates of multiple neurons in a monkey's primary motor cortex. The Bayesian model is formulated in terms of
the product of a likelihood and a prior. The likelihood term models the probability of neural firing rates given
a particular hand motion. The prior term defines a probabilistic model of hand kinematics. Decoding was performed
using a Kalman filter as well as a more sophisticated Switching Kalman filter. Off-line reconstructions of hand
trajectories were relatively accurate and an analysis of these results provides insights into the nature of neural
coding. Furthermore, I will show on-line neural control results in which a monkey exploits the Kalman filter to
move a computer cursor with its brain.
A Theoretical Comparison of the Data Augmentation, Marginal Augmentation and PX-DA Algorithms.
The data augmentation (DA) algorithm is a widely used MCMC algorithm that is based on a Markov transition
density of the form
p(x|x′) = ∫Y
The PX-DA algorithm of Liu & Wu (1999, JASA) and the marginal augmentation (MA) algorithm of Meng &
van Dyk (1999, Bka) are alternatives to DA that often converge much faster and are only slightly more
computationally demanding. The Markov transition densities of these alternative algorithms can be written
in the form
ƒY|X(y|x′) dy dy′,
where q is a Markov transition density on Y. We show that, under regularity conditions,
pR is more efficient than p in the sense that asymptotic variances
in the central limit theorem under pR are never larger than those under p.
These results are brought to bear on a theoretical comparison of the DA, MA and PX-DA algorithms. As an example,
we compare Albert & Chib's (1993, JASA) DA algorithm for Bayesian probit regression with the alternative
PX-DA algorithm developed by Liu & Wu. (This is joint work with Dobrin Marchev, Baruch College (CUNY) and
Vivekananda Roy, University of Florida).
Detecting Differentially-Expressed Time Course Gene Expression Profiles.
Among the large amounts of high-throughput biological data, time course gene expression profiles can reveal
important dynamic features of cell activities. Yet, not so much effort has been contributed to address the key
question of detecting differentially-expressed time course gene expression profiles. One reason may be that the
experimental designs for the time course gene expression data are not consistent across subjects (e.g., varying
sampling rates and the total number of time points sampled for each subject are often small). We present a
statistical method for detecting significance of differential time course gene expression data, which can be
applied when there are not so many time points for each subject or when the time grid is not regular. The idea of
our method is to integrate a newly-developed principal analysis through conditional expectation method and a
nonparametric bootstrap into a hypothesis test framework. In doing so, each gene will be assigned a p-value
pertaining to whether the gene is differentially expressed. Simulations and analysis of C. elegans data indicated
that the method performed better than the two-way ANOVA in identifying differentially-expressed genes in the dauer
Detecting Related Genes Using Bayesian Model-Based Clustering.
With microarray technology becoming one of the most popular tools in genetic research, the demand
for clustering methods has increased. In this paper, we propose a method to detect a small
number of related genes based on longitudinally observed gene expressions. For many reasons,
conventional techniques such as k-means or hierarchical clustering, cannot provide
satisfactory results. First, these methods do not differentiate between predictors and response
variables. Second, even though a goal of cluster analysis is to select a small number of genes that
can be investigated further in future research, most clustering algorithms tend to provide
large clusters as the end product. To overcome the limits of conventional methods, we use a
Bayesian model-based method that detects clusters using a linear mixed model,
and identifies related genes within those clusters using relevance probabilities.
Measuring Dietary Intake.
Newspaper articles routinely report the results of epidemiological studies of the
relationship between what we eat and disease outcomes such as heart disease and
various forms of cancer. One of the most-quoted studies is the Nurses Health Study,
which follows the health outcomes 100,000 nurses and asks them questions about their
dietary intakes. While there are exceptions, for the most part one can find a
relationship between heart disease and diet, e.g., less fat, more fruits, etc. On
the other hand, it is rare that prospective epidemiological studies of human
populations find links between cancer and dietary intakes. Perhaps the most
controversial of all is the question of the relationship between dietary fat intake
and breast cancer. Countries with higher fat intakes tend to have higher rates of
breast cancer, and yet no epidemiological study has shown such a link. The puzzle of
course is to understand the discrepancy.
Obviously, the etiology of disease may explain why heart disease, with its
intermediate endpoints such as serum cholesterol, has confirmed links to nutrition
while the evidence is mixed with cancer. I will focus instead on a basic question of
study design: how do we measure what we eat? Try this out: how many days per year do
you eat apples? I am going to review the accumulating evidence that suggests that
with complex, subtle disease such as cancer, with no good intermediate endpoints
such as serum cholesterol for heart disease, finding links between disease and
nutrient intakes will be the exception rather than the rule, simply because of the
way diet is measured. I will close with remarks about the Women’s Health Initiative
Dietary Intervention Trial and two new cohort studies, along with my own views of
General Semiparametric Analysis of Repeated Measures Data.
This talk considers the general problem where the data for an individual are
repeated measures in the most general sense, with a parametric component and a
nonparametric component. It is easy, although not well-known, to handle the problem
in the case that the nonparametric component of the likelihood function is evaluated
exactly once, e.g., when a baseline variable is modeled nonparametrically. Far more
difficult, and non-intuitive, is the case where the nonparametric component is
evaluated more than once in the likelihood function. Examples include repeated
measures studies, variance component models when the random effect is related to the
predictors, matched case-control studies with a nonparametric component,
fixed-effects models in econometrics, etc. I will present a constructive (i.e.,
computable), semiparametric efficient method for this general problem. The
constructive part is important: like most semiparametric efficient methods, there is
an integral equation lurking in the background, but unlike most such methods, in our
approach the integral equation can be avoided. An example involving caloric intake
and income in China is used to illustrate the methodology, as a means of contrasting
a random effects analysis and a fixed effects analysis.