Canonical Correlation Analysis
A general theoretical foundation is developed for canonical
correlation analysis. The formulation rests on the classical isometric mapping
between the Hilbert space spanned by a second order stochastic process and the
reproducing kernel Hilbert space (RKHS) generated by its covariance kernel. By
taking this approach it becomes possible to develop a notion of canonical
correlation that extends the classical multivariate concept to infinite
dimensional settings. Several infinite dimensional versions of canonical
correlations analysis that have appeared in the literature are shown to be
special cases of this general theory. A method for computing the canonical
correlations is described and illustrated.
Seminar page
A Bayesian Semi-Parametric Model for Random Effects
Meta-Analysis
In meta-analysis there is an increasing trend to explicitly
acknowledge the presence of study variability through random effects
models. That is, one assumes that for each study, there is a
study-specific effect and one is observing an estimate of this
latent variable. In a random effects model, one assumes that these
study-specific effects come from some distribution, and one can
estimate the parameters of this distribution, as well as the
study-specific effects themselves. This distribution is most often
modelled through a parametric family, usually a family of normal
distributions. The advantage of using a normal distribution is that
the mean parameter plays an important role, and much of the focus is
on determining whether or not this mean is 0. For example, it may
be easier to justify funding further studies if it is determined
that this mean is not 0. Typically, this normality assumption is
made for the sake of convenience, rather than from some theoretical
justification, and may not actually hold. We present a Bayesian
model in which the distribution of the study-specific effects is
modelled through a certain class of nonparametric priors. These
priors can be designed to concentrate most of their mass around the
family of normal distributions, but still allow for any other
distribution. The priors involve a univariate parameter that plays
the role of the mean parameter in the normal model. This work arose
from a problem in cardiology, which I will discuss and use to
illustrate the methodology.
This is joint work with Deborah Burr.
Seminar page
Analysis of Non-Stationary Time Series:
An Overview of the SLEX Methods
Many biological and physical signals are non-stationary in nature.
For example, brain waves recorded during an epileptic seizure have
waveforms whose amplitude (variance) and oscillatory behavior
(spectral distribution) change over time. This talk will address
some of the interesting statistical problems in signal processing
signals, namely, (i.) dimension reduction for massive multivariate
non-stationary signals and (ii.) feature extraction and selection
for classification and discrimination.
I will present an overview of a coherent and unified body of
statistical methods that are based on the SLEX (smooth localized
complex exponentials) library. The SLEX are time-localized Fourier
waveforms with multi-resolution support. The SLEX library provides a
systematic and efficient way of extracting transient spectral and
cross-spectral features. In addition, the SLEX methods are able to
handle massive data sets because they utilize computationally
efficient algorithms. As a matter of practical importance,
the SLEX methods give results that are easy to understand because
they are time-dependent analogues of the classical Fourier methods
for stationary signals. Finally, under the SLEX models, we develop
theoretical results of consistency for spectral estimation and
classification.
The SLEX methods will be illustrated using biological and physical
data sets, namely, brain waves, fMRI time series, a speech signal
and seismic waves recorded during earthquake and explosion events.
This talk will conclude with open and challenging problems in signal
analysis and the potential of SLEX in addressing these.
Seminar page
Models for Space-time Processes and Their Application to Air
Pollution
My talk consists of two main topics: (i) numerical model
evaluation in air quality and (ii) space-time covariance functions on
spheres.
(i) We have developed a new method of evaluating numerical models in air
quality using monitoring data. For numerical models in air quality, it is
important to evaluate the numerical model output in the space-time context
and especially to check how the numerical model follows the dynamics of
the real process. We suggest that by comparing certain space-time
correlations from observations with those from numerical model output, we
can achieve this goal. I will demonstrate how our method is applied to a
numerical model called CMAQ for sulfate levels over North America.
(ii) For space-time processes on global or large scales, it is critical to
use models that respect the Earth's spherical shape. However, there has
been almost no research in this regard. We have developed a new class of
space-time covariance functions on the sphere crossed with time from a sum
of independent processes, where each process is obtained by applying a
first-order differential operator to a fully symmetric process on sphere
crossed with time. The resulting covariance functions can produce various
types of space-time interactions and give different covariance structures
along different latitudes. Our approach yields explicit expressions for
the covariance functions and can also be applied to other spatial domains
such as flat surfaces. I will show the fitted result of our new covariance
functions to observed sulfate levels.
Finally, I will describe how we build a space-time model that combines
numerical model output and observations to build a space-time map of air
pollution levels. The information on space-time covariance structure
obtained from numerical model evaluation procedure is useful for building
the space-time model. Moreover, the space-time covariance functions on
spheres play a critical role here because of large spatial domain.
Seminar page
Bootstrap and Smoothing: Confidence Intervals for Population
Quantiles
The seminal work by Efron (1979) has laid down the landmark of a new branch of
modern statistical
analysis, namely bootstrap. Since then, the methods of bootstrap and smoothing
have become
important and practical methods in contemporary statistical analysis.
Bootstrap provides a
systematic way to estimate the standard errors of estimators based on
resampling techniques while smoothing concerns the use of kernel function to
smooth the density
estimators. In this talk, I shall first provide a brief introduction to the
concepts of bootstrap and smoothing,
and their roles in statistical estimation and analysis. Then, I shall discuss
an important research
area in bootstrap and smoothing, namely the estimation of the confidence
intervals for
population quantiles.
Seminar page
A Fully Nonparametric Test for One-Way Layouts
We consider the use of a vector of sign statistics as the basis of a
nonparametric test for equality of distributions in one-way layouts. An
important feature of this test is its ability to detect a broad range of
alternatives, including scale and shape differences. In this scenario,
the data consist of several independent measurements on each treatment or
subject. We will present finite sample and asymptotic distribution theory
for our test statistics and discuss a follow-up multiple comparisons
procedure based on a mixture model approximation. Finally, the methods
are illustrated using two examples from the literature.
Seminar page
Principal Components Analysis for High Dimensional Data
Suppose we have i.i.d. observations from a multivariate Gaussian
distribution with mean $\mu$ and covariance matrix $\Sigma$. We consider
the problem of estimating the leading eigenvectors of $\Sigma$ when the
dimension $p$ of the observation vectors increases with the sample
size $n$. We work under the setup where the covariance matrix is a finite
rank perturbation of identity. We show that even though the ordinary principal
components analysis may fail to yield a consistent estimator of the
eigenvectors, if the data can be sparsely represented in some known basis,
then a scheme based on first selecting a set of significant coordinates
and then applying PCA to the submatrix of sample covariance matrix
corresponding to the selected coordinates, gives better estimates. Under
suitable sparsity restrictions, we show that the risk of the proposed
estimator has the optimal rate of convergence when measured in a
squared-error type loss. We also state some results about the behavior of
the eigenvalues and eigenvectors of sample covariance matrix when
$p/n$ converges to a positive constant.
Seminar page
Regularization and Variable Selection via the Elastic Net
In the practice of statistical modeling, it is often desirable to have an
accurate predictive model with a sparse representation.
The lasso is a promising model building technique,
performing continuous shrinkage and variable selection simultaneously.
Although the lasso has shown success in many situations, it may
produce unsatisfactory results in some scenarios: (1) the number of
predictors (greatly) exceeds the number of observations;
(2) the predictors are highly correlated and form "groups".
A typical example is the gene selection problem in microarray analysis.
We propose the elastic net, a new regularization and variable
selection method. Real world data and a simulation study show that
the elastic net often outperforms the lasso, while enjoying a
similar sparsity of representation. In addition, the elastic net
encourages a grouping effect, where strongly correlated predictors
tend to be in or out of the model together. The elastic net is
particularly useful when the number of predictors is much bigger
that the number of samples. We have implemented an algorithm
called LARS-EN for efficiently computing the entire elastic net
regularization path, much like the LARS algorithm does for the
lasso. In this talk, I will also describe some interesting
applications of the elastic net in other statistical areas such as
the sparse principal component analysis and the margin-based
kernel classifier.
This is joint work with Trevor Hastie.
Seminar page
Model-Based Estimates of the Finite Population Mean for Two-Stage
Cluster Samples with Unit Nonresponse
We proposed new model-based methods for unit nonresponse in two-stage cluster
samples. Specifically, we showed that the usual random-effects model estimator
of the population mean (RE) is biased unless nonresponse is missing completely
at random, which makes the often unrealistic assumption that the response
rates are unrelated to cluster characteristics. This fact motivates
modifications of RE that allow the cluster means to depend on the response
rates in the clusters. Two approaches were considered, one that includes the
observed response rate as a cluster-level covariate (RERR), and one based on
a nonignorable probit model for response (NI1). We also considered another
nonignorable model estimate of the mean (NI2) that removes the bias of RERR
and NI1 when there is association between response and the survey outcome
within the clusters. To incorporate covariates, we discussed a semiparametric
approach, which addresses the curse of dimensionality problem.
Seminar page
Uniform Confidence Sets for Nonparametric Regression
with Application to Cosmology
In this talk, I will discuss recent work on constructing
confidence sets for an unknown function in nonparametric
regression problems. The goal is to construct a set -- usually a
ball in some space or, alternatively, bands -- that provides
(asymptotically) uniform coverage for the whole function.
Inferences can then be generated by searching this set,
possibly with added constraints from available side information.
One approach I'll describe extends results by Beran and Dumbgen
(1998) to wavelet bases and weighted L2 loss. I will demonstrate
these methods through an analysis of the Cosmic Microwave
Background spectrum, which is used by cosmologists to understand
the physics of the early universe.
While it is possible to construct estimators in nonparametric
regression that adapt to the unknown smoothness of the function,
constructing adpative confidence sets is not always possible. I'll
discuss this issue and describe an approach to nonparametric
inference, which I call confidence catalogs, in which the end
product is a mapping from assumptions to confidence sets.
This is joint work with Larry Wasserman.
Seminar page
Statistical Modelling of Data on the Connection Between Nonsteroidal
Anti-inflammatory Drugs and Cancer Risk
The purported connection between NSAID use and cancer is currently
receiving major attention in the medical community. Evidence is
accumulating that long-term use of NSAIDs such as aspirin greatly
decreases the risk of many cancers. I will present data from
several studies on relative risk of colon cancer, for NSAIDs users
vs. non-users, and consider meta-analysis techniques for combining
the estimates of relative risk. Bayesian models are used because
the ease with which they can be fit allows great flexibility. I
will present tools for model selection in this context, apply them
to this dataset, and show how they can be used more generally.
Seminar page
Theory and Inference for Parametric and Semiparametric
Methods in Missing Data Problems
Missing data is very common in various experimental settings,
including clinical trials, surveys and environmental studies. When
the missing covariates and/or missing response is missing at
random (MAR) or nonignorably missing, it is quite common to
specify a parametric distribution for the missing covariates as
well as a parametric distribution for the missing data mechanism.
However, since the true distributions are of course unknown, it
would be more desirable to express the class of possible covariate
distributions and/or missing data mechanisms through a class of
semiparametric/nonparametric models in order to obtain more robust
estimators of the regression coefficients of interest. In this
talk, we first consider a class of semiparametric models for the
covariate distribution and/or missing data mechanism for ignorable
and nonignorable missing covariate and/or response data for
general classes of regression models including generalized linear
models (GLM's) and generalized linear mixed models (GLMM's). The
semiparametric model consists of a generalized additive model
(GAM) for the covariate distribution and/or missing data
mechanism. Penalized regression splines are used to express the
GAM's as a generalized linear mixed effects model, in which the
variance of the corresponding random effects provides an intuitive
index for choosing between the semiparametric and parametric
model. Maximum likelihood estimates are then obtained via the EM
algorithm. The semiparametric method proposed above can provide
reasonable robust estimates for moderate departures from the true
underlying covariate distribution. However, for severe departures
from the true underlying model, the proposed semiparametric
methods may not perform satisfactorily, and therefore, a class of
fully nonparametric models is proposed for the covariate
distribution with MAR covariates in generalized linear models. The
estimates of the regression coefficients are obtained by
maximizing a pseudo-likelihood function over a sieve space, and
are shown to be consistent, asymptotically normal and most
efficient semiparametrically. Another topic in this talk is to
examine theoretical connections as well as small and large sample
properties of three commonly used model-based methodologies
dealing with missing data, Multiple Imputation (MI), Maximum
Likelihood (ML) and Fully Bayesian (FB) methods. In this paper, we
derive small sample and asymptotic expressions of the estimates
and standard errors for these three methods, investigate the small
and large sample properties of the estimates, and fully examine
how these estimates are related for the three approaches in the
linear regression model when the responses or covariates are
missing at random (MAR). We show that when the responses are MAR
in the linear model, the estimates of the regression coefficients
using these three methods are asymptotically equivalent to the
complete case (CC) estimates under very general conditions. With
MAR continuous covariates in the linear model, we derive the
imputation distribution under proper MI, the iterative formula of
the estimates and closed form expressions for the standard errors
under the ML method via the EM algorithm, as well as closed form
full conditional distributions for Gibbs sampling under the FB
framework. Simulations are conducted to demonstrate and compare
the methodologies. A real dataset from a melanoma cancer clinical
trial is analyzed using the proposed methods, and a liver cancer
clinical trial is analyzed using the CC, MI, ML, and FB methods.
Seminar page
A Computationally Quick Bootstrap Procedure for Semiparametric
Models
We introduce a computationally quick bootstrap procedure. Like the
weighted bootstrap, our procedure can be used to generate random draws
that approximate the joint sampling distribution of the parametric and
nonparametric maximum likelihood estimators in a wide variety of
semiparametric models, including several useful biased sampling and
survival analysis models. But the dimension of the maximization problem
for each bootstrapped likelihood is smaller. The procedure can be stated
quite simply. First obtain a valid random draw for the parametric
component of the model, which in many cases can be done at very low
computational cost. Then take the draw for the nonparametric component to
be the maximizer of the weighted bootstrap likelihood with the
parametric component fixed at the parametric draw. This avoids the
computationally costly computation of the parametric maximizers of the
weighted bootstrap likelihoods necessary to give the parametric draws in
the weighted bootstrap. We illustrate the computational savings this
provides in constructing confidence bands for simulated vaccine efficacy
trials.
Seminar page
Gene Selection Using Support Vector Machines with Nonconvex
Penalty
With the development of DNA microarray technology, scientists can now measure
the expression levels of thousands of genes simultaneously in one single
experiment. The fundamental problem of gene selection in cancer study is to
identify which groups of genes are differentially expressed in normal and
cancerous cells, and it leads to a better understanding of genetic signatures
in cancer and the improvement on cancer treatment strategies. Though gene
selection and cancer classification are two closely related problems, many
existing approaches handle them separately by selecting genes prior to
constructing the classifier rule. Our motivation is to provide a unified
procedure for simultaneous gene selection and cancer classification and
achieve high accuracy in both aspects.
The "high dimension low sample size" structure of microarray data demands
more flexible and powerful statistical tools for analysis. In this talk we
introduce a novel type of regularization in support vector machines (SVM)
to identify important genes for cancer classification. A special nonconvex
penalty, called the smoothly clipped absolute deviation penalty, is imposed
on the hinge loss function in the SVM. By systematically thresholding small
estimates to zeros, the new procedure eliminates redundant genes
automatically and yields a compact and accurate classifier.
Seminar page