Bayesian nonlinear regression and classification in the ``Large
p, Small n" problems
We propose nonlinear Bayesian models for problems in which sample sizes are
substantially smaller than the number of available predictor variables. Novel
models for regression and classification are needed to handle these challenging
problems. We consider Bayesian regression and classification methods based
on kernel reproducing Hilbert spaces. We consider the Gaussian likelihood,
logistic likelihood as well as likelihoods related to the Support Vector
Machine (SVM) models. We develop MCMC based computation methods and exploit it
to perform posterior and predictive inferences. The regression problem is motivated
by calibration problems in near-infrared (NIR) spectroscopy. We consider the
non-linear regression setting in which many predictor variables arise from
sampling an essentially continuous curve at equally spaced points and there may
be multiple predictants. Precise classification of tumors is critical for cancer diagnosis
and treatment. Diagnostic pathology has traditionally relied on macro
and microscopic histology and tumor morphology as the basis for tumor
classification. In recent years,
there has been a move towards the use of cDNA microarrays for
tumor classification. These high-throughput assays provide relative mRNA
expression measurements simultaneously for thousands of genes. We will use our
nonlinear models to perform classification via different gene expression
patterns. Seminar page
The 95% Upper Confidence Bound for the Mean of Lognormal Data
The lognormal distribution is the most commonly used probability
density model for environmental contaminant data. In environmental exposure
assessment, the average concentration of the contaminant at a site is of
direct interest, but typically the 95% upper confidence bound (UCL) is used
as a conservative estimate of average concentration for subsequent risk
calculations. In this talk ten (10) different estimators used to compute
the UCL will be presented and discussed. Results on a Monte Carlo simulation
as well as comparisons using soil concentrations of arsenic from a Florida
site will be discussed. Seminar page
Generating a random signed permutation with random reversals
Signed permutations form a group known as the hyperoctahedral group. We
determine the mixing time for a certain random walk on the
hyperoctahedral group that is generated by random reversals.
Specifically, we show that $O(n \log n)$ steps are both necessary and
sufficient for total variation distance to uniformity to become small.
This research was motivated by an effort in mathematical biology to
model the evolutionary changes in gene order. Seminar page
Recent Advances in Design of Experiment
Design of Experiment seeks an efficient way to collect useful information.
In the past decade, we have witnessed the resolution of information
technology.
Many classical methods in design of experiment originated from
agricultural
problems may not be appropriate for the informatic era.
This talk attempts to explore some important issues in design of
experiment
that deserve immediate attention for future research.
Topics to be discussed include supersaturated design, uniform design,
computer experiment, optimal foldover plan, dispersion analysis,
and multi-response surface methodology.
For each subject the problem will be introduced,
some initial results will be presented,
and future research problems will be suggested.
Seminar page
Statistical Work Experiences in a Major Pharmaceutical Company
This talk will highlight the different phases in pharmaceutical drug discovery and development and the
potential impact a statistician can attain in collaboration with other scientific colleagues. The talk will
enlist the statistical techniques used most regularly in pre-clinical, early phase and late-phase human
trials. Internship opportunities for students will also be discussed. Seminar page
The In-and-Out-of-Sample (IOS) Likelihood Ratio Test
for Model Misspecification
A new test of model misspecification is proposed, based on the ratio
of in-sample and out-of-sample likelihoods. The test is broadly
applicable, and in simple problems approximates well known,
intuitive methods. Using jackknife influence curve approximations,
it is shown that the test statistic can be viewed asymptotically as
a multiplicative contrast between two estimates of the information
matrix that are equal under correct model specification. This
approximation is used to show that the statistic is asymptotically
normally distributed, though it is suggested that p-values be
computed using the parametric bootstrap. The resulting methodology
is demonstrated with a variety of examples and simulations involving
both discrete and continuous data. This is joint work with Dennis
Boos of North Carolina State University. Seminar page
Analysis of Recurrent Events Time Data with Dependent Termination
Time
We consider modeling and Bayesian analysis
for recurrent events data when the termination time
for each subject may depend on the history of the recurrent events.
We propose a fully specified stochastic model for the joint
distribution of the recurrent events and the termination time. For this
model, we
provide a natural motivation, derive several novel properties and
develop a Bayesian analysis based on a Markov chain Monte Carlo algorithm.
Comparisons are made to the existing models and methods for recurrent
events and
panel count data. We demonstrate the usefulness of our new models and
methodologies through the reanalysis of a dataset from a clinical trial. Seminar page
Extended Likelihood Inference Applied to a New Class of Models
- A Two Part Talk
Talk I
Random-effect models require an extension of Fisher likelihood. Extended
likelihood (Pawitan) or, equivalently, h-likelihood (Lee & Nelder), provide a
basis for likelihood inference applicable to random-effect models. The model
class, called hierarchical generalized linear models (HGLMs), is derived from
generalized linear models (GLMs). It supports (1) joint modelling of mean
and dispersion; (2) GLM errors for the response; (3) random effects in the
linear predictor for the mean, with distributions following any conjugate
distribution of a GLM distribution; (4) structured dispersion components
depending on covariates. Fitting of fixed and random effects, given
dispersion components, reduces to fitting an augmented GLM, while fitting
dispersion components, given fixed and random effects, uses an adjusted
profile h-likelihood and reduces to a second interlinked GLM, which
generalizes REML to all the GLM distributions. A single algorithm can fit all members of the class and does not require
either the multiple quadrature necessary for methods using marginal
likelihood, or prior distributions as used in Bayesian methods. Model
checking also generalizes from GLMs and allows the visual checking of all
aspects of the model. Software in the form of Genstat procedures will be
used for illustrative examples.
Talk II
The model class of Talk I is extended to cover correlated data expressed
by random effects in the model, thus allowing fitting of spatial and temporal
models with GLM errors. Correlations can be expressed by transformations of
white noise, by structured covariance matrices, or by structured precision
matrices. An important subclass can be expressed in terms of dispersion
components only, allowing the generalization of the analysis of balanced
designs with normal errors to the GLM class of distributions.
Finally the class can be extended to double HGLMs, which allow random effects in the dispersion model as well as in the mean. Analysis is still by means
of interlinked extended GLMs and GLMs. This leads, among other things, to a
potentially large expansion of classes of models used in finance, the
properties of which have still to be investigated.Seminar page
Issues in Analysis of Unbalanced Mixed Model Data
Major transition has occurred in recent years in statistical methods
for analysis of linear mixed model data from analysis of variance (ANOVA) to
likelihood-based methods. Prior to the early 1990's, most applications used
ANOVA because computer software was either not available or not easy to use
for likelihood-based methods. ANOVA is based on ordinary least squares
computations, with adoptions for mixed models. Computer programs for such
methodology were plagued with technical problems of estimability, weighting,
and handling missing data. Likelihood-based methods mainly utilize a
combination of residual maximum likelihood (REML) estimation of covariance
parameters and generalized least squares (GLS) estimation of mean
parameters. Software for REML/GLS methods became readily available early in
the 1990's, but the methodology still is not universally embraced. Although
many of the computational inadequacies have been overcome, conceptual
problems remain. This talk will identify certain problems with ANOVA, and
describe which remain and which are resolved with REML/GLS. Seminar page
"Semiparametric" Approaches for
Inference in Joint Models for Longitudinal and Time-to-Event Data
A common objective in longitudinal studies is to characterize the
relationship between a longitudinal response process and a
time-to-event. Considerable recent interest has focused on so-called
joint models, where models for the event time distribution (typically
proportional hazards) and longitudinal data are taken to depend on a
common set of latent random effects, which are usually assumed to
follow a multivariate normal distribution. A natural concern is
sensitivity to violation of this assumption. I will review the
rationale for and development of joint models and discuss two modeling
and inference approaches that require no or only mild assumptions on
the random effects distribution. In this sense, the models and
methods are "semiparametric." The methods will be demonstrated by
application to data from an HIV clinical trial. Seminar page
Challis Lecture: As Time Goes By -- An Introduction to Methods for Analysis of
Longitudinal Data
Studies in which data are collected over time repeatedly on each of a
number of individuals are ubiquitous in health sciences research.
Over the past few decades, fundamental advances in the development of
statistical methods to analyze such longitudinal data have been made.
However, although these methods are widely-accepted by research
statisticians, they are less well-known among biomedical researchers,
who thus may be reluctant to embrace them. In this talk, I will
discuss the types of scientific questions that may be of interest in
longitudinal studies, the rationale for the need for specialized
methods for longitudinal data analysis, and the basic elements of
popular longitudinal data methods and their advantages over
cross-sectional and ad hoc approaches. There will be some, but not
many, equations, but lots of pictures and examples drawn from
collaborations in cardiology, pharmacology, HIV research, and other
substantive fields. Seminar page
Non-parametric Alternatives For Mapping Quantitative Trait Loci:
Some Statistical Comparisons And Applications To Alcohol-related
Phenotypes in COGA
Unlike qualitative or binary traits which can be characterized
completely by allele frequencies and genotypic penetrances,
quantitative traits require an additional level of modeling:
the probability distribution of the underlying trait. Hence,
likelihood based methods like variance components, which
require assumptions like multivariate normality of trait values
within a family, may yield misleading linkage inferences when
underlying model assumptions are violated. The Haseman-Elston
regression method (1972) and its extensions do not assume any
specific probability distribution for the trait values, but
assume a linear relationship between the squared sib-pair trait
differences (or mean-corrected cross products of sib-pair trait values)
and the estimated identity-by-descent scores at a marker locus. Since it
is often difficult to test the validity of these assumptions, it is of
interest to explore for non-parametric alternatives. Ghosh and Majumder (2000)
have therefore proposed that it may be strategically more judicious to
empirically estimate the nature of dependence of the two above-mentioned
variables using non-parametric diagnostics like rank correlation or
kernel-smoothing regression. In this study, we modify our earlier methodologies to multipoint mapping
and compare their performances to the linear regression procedures of
Elston et al. (2000). We find that while the non-parametric regression method
is marginally less powerful than the linear regression methods in the absence
of dominance, it performs increasingly better as dominance increases. The
non-parametric method also outperforms the linear regression procedures with
increasing deviation of the distribution of trait values from normality. We have used the non-parametric regression method to analyze data on
Beta 2 EEG waves and externalizing scores collected in the Collaborative
Study on the Genetics of Alcoholism (COGA) project. We have obtained
statistically significant signals of linkage on Chromosomes 1, 4, 5 and 15.
We have also investigated the presence of epistatic interactions between
regions exhibiting significant linkage. Evidence of epistasis was found
between regions on Chromosomes 1 and 4 with those on Chromosome 15. Seminar page