DNA Microarrays and Statistical Challenges
Microarray technology is a breakthrough in genetic science. It is
capable of profiling gene expression patterns at a genome-wide
scale in a single experiment. An individual microarray experiment
can yield information on the amount of RNA transcribed in each of
tens of thousands genes, and a single study can involve anything
from 1 to 100s of such experiments. This talk gives a brief
introduction to the microarrays and describes some of the
statistical problems that arise in the area. Most of our examples
will be related to the use of microarrays for cancer research and
investigation of yeast cell cycle.Seminar page
Sorting Periodically-Expressed Genes Using Microarray Data
The publicly available datasets from yeast are excellent starting
points to evaluate and develop statistical tools for handling
information from DNA microarray experiments. In this paper we
compare two statistical methods, Fourier analysis and singular
value decomposition (SVD), for analyzing gene expression ratios in
yeast cells as they progress through the cell cycle. We find that
analysis of array data using Fourier analysis and using SVD can be
carried out simply, without extensive data manipulations described
in previous papers. We propose a circular correlation method
for comparing different procedures to order genes around
a unit circle (i.e., genes that are expressed in a cyclical
fashion). The results of applying this method reflected the close
relationship between the Fourier and SVD methods.
In addition we develop a stochastic search algorithm that can be
entrained on a set of genes known to be cell cycle-regulated, and
then applied as a systematic and data-driven method to classify
genes into specific stages of the cell cycle.Seminar page
Optimal Design for Nonlinear Regression Models
Under antiviral treatment, HIV dynamics can be represented by
nonlinear models. Optimal design for such models is concerned with a
choice of sampling times so as to achieve high efficiency in parameter
estimation. When data are analyzed separately for each patient,
fixed-effects nonlinear regression models are appropriate. Analytic
results will be presented regarding locally D- and c-optimal designs
for a particular model, where the c-optimal design is for estimating a decay rate. When data from different subjects are analyzed simultaneously, a nonlinear
mixed-effects model is often used. After a case study in analysis of HIV
dynamic data is presented, optimal design issues of such models will also
be discussed.Seminar page
Sequential Monte Carlo and Dirichlet Mixtures for Extracting Protein Alignment Models
Construction of a multiple sequence alignment can give biologists a
good structural and functional insight into a protein sequence of this family.
The multiple alignments of
proteins are used to find the functional patterns characterising protein
families, detecting homology between new sequences and existing families,
predicting secondary and tertiary structure of proteins, and understanding
the biological roles of a target protein. Multiple sequence alignment (MSA) can be viewed as computation of a
posterior mean
$E(\bTheta\:|\:S^{(1)},\:...,\:S^{(n)})$, where $\bTheta$ is a position
specific profile matrix and $S^{(1)},\:...,\:S^{(n)}$ are the sequences to
be aligned. $P(S^{(i)}\:|\:\bTheta)$ can be described by a Hidden Markov
Model, with a ``hidden'' state being an unknown alignment path $A^{(i)}$ which generates $S^{(i)}$ from (unknown) $\bTheta$. I will describe a novel query
centric Bayesian algorithm which applies Sequential Monte Carlo (SMC)
framework to align the sequences and
create the position specific profile matrix $\bTheta$, regarding unknown
profile and paths aligning each of the sequences to the profile as
missing data and imputing them sequentially. After information from all
sequences is incorporated into the final profile, the sequences are
realigned to the profile, and Gibbs Sampler is introduced to improve the
multiple alignment. The use of SMC and Gibbs Sampler allows the procedure to
avoid getting stuck in a local mode. As a base algorithm for aligning
new sequences and the profile we
will use an extension of a Hidden Markov Model based pairwise sequence
alignment which will be briefly described. To capture information contained
in a column of aligned amino acids our method uses a Drichlet Mixture prior.
I will show examples of our MSA method performance and compare it with
other methods.Seminar page
Mixtures-of-Experts of Generalized Linear Time Series
We consider a novel class of non-linear models based on mixtures of
local generalized linear time series. In our construction, at any
given time, we have a certain number of generalized linear models
(GLM), denoted experts, where the vector of covariates may include
functions of lags of the dependent variable. Additionally, we have a
latent variable, whose distribution depends on the same covariates as
the experts, that determines which GLM is observed. This structure is
considerably flexible, as was shown by Jiang and Tanner in a series of
papers for mixtures of GLM with independent observations. For
parameter estimation, we show that maximum likelihood (ML) provides
consistent and asymptotically normal estimators under certain
regularity conditions. We perform some Monte Carlo simulations to
study the properties of the ML estimators for finite samples.
Finally, we apply the proposed models to study some real examples of
time series in Marketing and Finance.Seminar page
Flexible Versus Parsimonious Regression Models: Asymptotic Relative
Efficiency of Chi-Square Tests of Association with Unequal Degrees of
Freedom
In many research applications, interest centers on assessing
evidence of an association between a quantitative explanatory variable and a
discrete or continuous response variable in the presence of covariates. A
common strategy for assessing statistical significance is to specify a
parametric model for the unknown regression function and then to test
whether the term or terms pertaining to the variable of interest can be
deleted. For maximizing statistical power, a parsimonious and accurate
representation of the effect of the explanatory variable of interest is
desirable. If a choice is to be made between parsimony and accuracy, how
should one proceed? To help address this question, an explicit closed-form
expression is derived for the asymptotic relative efficiency of two
chi-square test statistics with different degrees of freedom corresponding
to two different strategies for modeling the explanatory variable of
interest. This asymptotic result permits numeric evaluation and can be used
to develop guidelines for selecting a model in light of uncertainty about
the correct functional form. The numeric results shed light on the question
of whether or not adding terms to a regression model to reduce the extent of
model misspecification will increase or decrease statistical efficiency.Seminar page
Multivariate Extremes, Max-Stable Process Estimation and Dynamic Financial Modeling
Studies have shown that time series data from finance, insurance and
environment etc. are fat tailed and clustered when extremal events
occur. In an effort to characterize such extremal processes,
max-stable processes or min-stable processes have been proposed
since the 1980s and some probabilistic properties have been obtained.
However, applications are very limited due to the lack of efficient
statistical estimation methods. Recently, the author has shown some
probabilistic properties of the processes and proposed a series of
estimation procedures to estimate the underlying max-stable processes,
i.e., multivariate maxima of moving maxima processes. In this talk, I
will present some basic properties, estimating procedures of multivariate
extremal processes, and illustrate how to model financial data as moving
maxima processes. Examples will be illustrated with GE, Citibank, Pfizer
stock index data.Seminar page
Analysis of Longitudinal Binary Data with Non-Ignorable Missing Values
Many longitudinal dementia studies on elderly individuals suffer from a significant amount of missing data, most of which are due to death of the study participants. It is generally believed that these missing data by death are non-ignorable to likelihood based inference. Inference based on data from surviving individuals only may lead to biased results. I will present three approaches to deal with these missing data in dementia studies. The first one adopts the selection model framework where the probability of missing is explicitly modeled to depend on the missing outcomes. The second approach models both the probability of disease and the probability of missing using shared random effect parameters. Lastly, we will set up an illness-death stochastic model to simultaneously estimate disease incidence and mortality rates. Data from a longitudinal dementia study will be use to illustrate all three approaches.Seminar page
Overdetermined Estimating Equations with Applications to Panel Data
Panel data has important advantages over purely cross-sectional or
time-series data in studying many economic problems, because it contains
information about both the intertemporal dynamics and the individuality of
the entities being investigated. A commonly used class of models for
panel studies identifies the parameters of interest through an
overdetermined system of estimating equations. Two important problems
that arise in such models are the following: (1) It may not be clear
{\it{a priori}} whether certain estimating equations are valid. (2) Some
of the estimating equations may only ``weakly'' identify the parameters of
interest, providing little information about these parameters and making
inference based on conventional asymptotic theory misleading. A procedure
based on empirical likelihood for choosing among possible estimators and
selecting variables in this setting is developed. The advantages of the
procedure over other approaches in the econometric literature are
demonstrated through theoretical analysis and simulation studies. Related
results on empirical likelihood, the generalized method of moment and
generalized estimating equations are also presented.Seminar page
A Bayesian Framework for Analyzing Exposure Data from the Iowa
Radon Lung Cancer Study
A Bayesian approach is developed for modeling the association between lung
cancer risk and residential radon exposure measured with error. Markov
chain
Monte Carlo (MCMC) methods are used to fit the model to data from the Iowa
Radon
Lung Cancer Study. Fast and efficient C++ libraries were written to
facilitate
the use of MCMC methods with the large Iowa dataset. The proposed
methodology
introduces a measurement model for the radon process and uses the modeled
radon
concentrations as the exposure variable in the risk model. Reducing the
measurement error in the exposure variate can lead to improved estimates of
the
relationship between lung cancer risk and radon exposure. The disease and
radon
process are modeled jointly so that both the measured radon concentrations
and
the risk information contribute to the estimated true radon exposures. This
joint modeling approach has the potential to improve both the estimated true
exposures and the measured association between exposure and disease risk.
The
Bayesian risk model is applied to data from the Iowa Radon Lung Cancer
Study.Seminar page
Multicategory Support Vector Machines
The Support Vector Machine (SVM) has become a popular choice of
classification tool in practice. Despite its theoretical
properties and its empirical success in solving binary problems,
generalization of the SVM to more than two classes has not
been obvious. Oftentimes multicategory problems have been
treated as a series of binary problems in the SVM paradigm.
However, solutions to a series of binary problems may not be optimal
for the original multicategory problem.
We propose multicategory SVMs, which extend the binary SVM to the
multicategory case, and encompass the binary SVM as a special case.
The proposed method deals with the equal misclassification cost and
the unequal cost case in a unified way.
It is shown that the multicategory SVM implements the optimal
classification rule for appropriately chosen tuning parameters
as the sample size gets large.
The effectiveness of the method is demonstrated through
simulation studies and real applications to cancer
classification problems using gene expression data.Seminar page
Design and Inference for Gaussian Random Fields
Gaussian random fields (GRFs) can be used to model many
physical processes in space. In this talk we present two kinds of
results for GRFs: spatial sampling design and covariance parameter
estimation. We study spatial sampling design for prediction of
stationary isotropic GRFs with estimated parameters of the covariance
function. The key issue is how to incorporate the parameter
uncertainty into the design criteria. Several possible design
criteria are discussed. An annealing algorithm is used to search for
optimal designs of small sample size and a two-step algorithm is
proposed for moderately large sample sizes. Simulation results are
presented for the Mat\'ern class of covariance functions. The
inference issue we consider is the asymptotic properties of estimates
of parameters of fractional Brownian motion. We give the fixed-domain
asymptotic distributions of both least square and maximum likelihood
estimates, which are different from the more standard
increasing-domain asymptotic results. We discuss why these results
should still apply when the process is not fractional Brownian motion
but instead a GRF with covariance function in the Mat\'ern class.Seminar page
Some Statistical and Computational Challenges of Population Genetics After The Human Genome Project
Genomic study has entered a new era with the completion of
the first draft of Human genome. Human genome harbors millions
of single nucleotide polymorphic sites (SNPs), identification and
characterization of which is an area of intense research and
competition. SNPs are and will be used in many biomedical
and population studies, and it is fundamentally important to
understand the statistical properties of SNPs. Coalescent theory, a branch
of theoretical population genetics, offers excellent tools for
studying SNPs. I will review basic components of the theory and
recent advances that are relevant for studying SNPs. I will
discuss the problem of ascertainment bias, which
arises due to use of a small sample to detect SNPs, and present
a couple of examples on the application of coalescent theory.
Many analyses of genomic data require considerable computational resource,
I will discuss a solution for building a powerful yet inexpensive computation
farm based Java technology.Seminar page
Penalty Choices and Consistent Covariate Selection in
Semiparametric Models
We suggest a model selection approach
for estimation in semiparametric regression models and investigate
the compatibility of the
following optimality aspects:
consistent covariate selection of the parametric component,
asymptotic normality of the {\it selected} estimator of the
parametric part and adaptive estimation of the nonparametric component.
We show that these goals cannot be attained simultaneously by a direct
extension of standard parametric or nonparametric model selection methods.
We introduce a new type of penalization, tailored to semiparametric models
and present one and two stage estimation procedures, discussing their
respective merits. Examples of models to which this methodology applies
include:
partially linear
regression, generalized semilinear regression models and semiparametric
hazard function regression models. We illustrate our method and present a
simulation study for the partially linear regression model.Seminar page
What do we MEAN by the MEDIAN?
This talk is a survey of multivariate extensions of the univariate
median. Particular
emphasis is placed on multivariate generalizations that have the
property of affine-equivariance, the benefits of which are explained. A
new multivariate median
is discussed which has some desirable properties, the most important of
which is
that it is easy to compute in common dimensions.Seminar page
Modelling Dependence in Longitudinal Data
Modelling the dependence structure in longitudinal data is often not
given as much attention as modelling the mean structure.
However, carefully
modelling the dependence can be important
in a variety of situations.
Clearly, if the goal is prediction, such modelling is very important.
In addition, when there is missing data that is thought to be
missing at random (MAR) or non-ignorable,
incorrectly modelling the dependence can result in biased
estimates of fixed effects. Bias can also be a problem when the
random effects covariance matrix is mis-modelled in generalized linear
mixed models.
In general, correctly modelling the dependence will increase the efficiency of
estimators for fixed effects. We will propose several approaches to model
dependence.
In particular, we will discuss parameterizations of, and prior
distributions for, the covariance
matrix in the context of shrinking toward a structure and in developing
parametric models with covariates
(heterogeneous covariance structures).
The results of simulations to assess these approaches will be reported and
computationally efficient approaches to compute estimators and fit
models
will be proposed.
Some of these approaches will be applicable for modelling both the 'marginal'
covariance matrix (applicable for continuous normal data) and the random
effects covariance matrix (applicable for continuous and discrete data).
Data collected from a series of depression trials will be used to
illustrate these models.Seminar page
Analysis of Correlated Survival Data With Dependent
Censoring
In biomedical studies, patient survival time may be
censored by many events. Some censoring events, such as patient
voluntary withdrawal or changing treatment, may be related to
potential failure. This is called dependent censoring in survival
analysis. However, due to the identifiability problem, these events
are often assumed to be independent of failure, even though this false
assumption can lead to biased inference. It is also common that
subjects in the study are naturally clustered, for example, they are
patients from multiple medical centers, or they are family members, or
they are litter mates. Subjects in the same cluster share some common
factors. Therefore, their survival outcomes are likely to be
correlated, instead of independent each other. In these cases, we have
correlated survival data. In my talk, I will consider these two
problems (dependent censoring and correlated data) simultaneously. I
developed a test to check if dependent censoring is present. I also
developed a model to analyze correlated survival data with dependent
censoring. The EM algorithm is used to fit the model. In the E-steps,
integrals without a closed form are evaluated by Markov Chain Monte
Carlo method. Metropolis-Hastings algorithm is used to construct the
Markov Chain of random numbers with a desired distribution. Simulation
studies and analysis of a real data set of kidney disease patients are
provided.Seminar page
A New/Old Approach to Statistical Modelling
In the 1880s and 1890s Francis Galton and Karl Pearson argued about the
foundations of Statistics and the best way forward. Pearson won. As often,
later history was re-written from the view of the victors. This talk
explores a possible alternative history, had Galton won. The focus is
particularly on statistical modelling.Seminar page
Stepwise Intersection-Union Tests with Applications in
Biostatistics
The stepwise intersection-union test method allows multiple
hypotheses to be tested with some hypotheses possibly rejected
and others not rejected. This method of multiple testing is
particularly simple because each hypothesis is tested with an
alpha level test, but the familywise error rate of the method is
also alpha. In this talk, the stepwise intersection-union test
method will be explained, and its use will be illustrated in
several biostatistical applications. These applications include
testing primary, secondary, and tertiary outcomes; determining a
minimum effective dose; and estimating the onset and duration of
a treatment effect.Seminar page
Quantifying Environmental Risk via Low-Dose Extrapolation
Statistical models and methods are discussed for quantifying the
risk associated with exposure to environmental hazards. Attention
is directed at problems in dose-response/concentration-response
modeling when low-dose risk estimation is a primary goal. Possible
low-dose measures include observed-effect levels, effective
doses/effective concentrations, added risk/extra risk measures,
and benchmark doses.Seminar page
TBA
.Seminar page