Randy Eubank, Texas A&M University

Canonical Correlation Analysis

A general theoretical foundation is developed for canonical correlation analysis. The formulation rests on the classical isometric mapping between the Hilbert space spanned by a second order stochastic process and the reproducing kernel Hilbert space (RKHS) generated by its covariance kernel. By taking this approach it becomes possible to develop a notion of canonical correlation that extends the classical multivariate concept to infinite dimensional settings. Several infinite dimensional versions of canonical correlations analysis that have appeared in the literature are shown to be special cases of this general theory. A method for computing the canonical correlations is described and illustrated.
Seminar page

Hani Doss, The Ohio State University

A Bayesian Semi-Parametric Model for Random Effects Meta-Analysis

In meta-analysis there is an increasing trend to explicitly acknowledge the presence of study variability through random effects models. That is, one assumes that for each study, there is a study-specific effect and one is observing an estimate of this latent variable. In a random effects model, one assumes that these study-specific effects come from some distribution, and one can estimate the parameters of this distribution, as well as the study-specific effects themselves. This distribution is most often modelled through a parametric family, usually a family of normal distributions. The advantage of using a normal distribution is that the mean parameter plays an important role, and much of the focus is on determining whether or not this mean is 0. For example, it may be easier to justify funding further studies if it is determined that this mean is not 0. Typically, this normality assumption is made for the sake of convenience, rather than from some theoretical justification, and may not actually hold. We present a Bayesian model in which the distribution of the study-specific effects is modelled through a certain class of nonparametric priors. These priors can be designed to concentrate most of their mass around the family of normal distributions, but still allow for any other distribution. The priors involve a univariate parameter that plays the role of the mean parameter in the normal model. This work arose from a problem in cardiology, which I will discuss and use to illustrate the methodology.

This is joint work with Deborah Burr.
Seminar page

Hernando Ombao, University of Illinois at Urbana-Champaign

Analysis of Non-Stationary Time Series: An Overview of the SLEX Methods

Many biological and physical signals are non-stationary in nature. For example, brain waves recorded during an epileptic seizure have waveforms whose amplitude (variance) and oscillatory behavior (spectral distribution) change over time. This talk will address some of the interesting statistical problems in signal processing signals, namely, (i.) dimension reduction for massive multivariate non-stationary signals and (ii.) feature extraction and selection for classification and discrimination.

I will present an overview of a coherent and unified body of statistical methods that are based on the SLEX (smooth localized complex exponentials) library. The SLEX are time-localized Fourier waveforms with multi-resolution support. The SLEX library provides a systematic and efficient way of extracting transient spectral and cross-spectral features. In addition, the SLEX methods are able to handle massive data sets because they utilize computationally efficient algorithms. As a matter of practical importance, the SLEX methods give results that are easy to understand because they are time-dependent analogues of the classical Fourier methods for stationary signals. Finally, under the SLEX models, we develop theoretical results of consistency for spectral estimation and classification.

The SLEX methods will be illustrated using biological and physical data sets, namely, brain waves, fMRI time series, a speech signal and seismic waves recorded during earthquake and explosion events. This talk will conclude with open and challenging problems in signal analysis and the potential of SLEX in addressing these.
Seminar page

Mikyoung Jun, University of Chicago

Models for Space-time Processes and Their Application to Air Pollution

My talk consists of two main topics: (i) numerical model evaluation in air quality and (ii) space-time covariance functions on spheres.

(i) We have developed a new method of evaluating numerical models in air quality using monitoring data. For numerical models in air quality, it is important to evaluate the numerical model output in the space-time context and especially to check how the numerical model follows the dynamics of the real process. We suggest that by comparing certain space-time correlations from observations with those from numerical model output, we can achieve this goal. I will demonstrate how our method is applied to a numerical model called CMAQ for sulfate levels over North America.

(ii) For space-time processes on global or large scales, it is critical to use models that respect the Earth's spherical shape. However, there has been almost no research in this regard. We have developed a new class of space-time covariance functions on the sphere crossed with time from a sum of independent processes, where each process is obtained by applying a first-order differential operator to a fully symmetric process on sphere crossed with time. The resulting covariance functions can produce various types of space-time interactions and give different covariance structures along different latitudes. Our approach yields explicit expressions for the covariance functions and can also be applied to other spatial domains such as flat surfaces. I will show the fitted result of our new covariance functions to observed sulfate levels.

Finally, I will describe how we build a space-time model that combines numerical model output and observations to build a space-time map of air pollution levels. The information on space-time covariance structure obtained from numerical model evaluation procedure is useful for building the space-time model. Moreover, the space-time covariance functions on spheres play a critical role here because of large spatial domain.
Seminar page

Yvonne H.S. Ho, The University of Hong Kong

Bootstrap and Smoothing: Confidence Intervals for Population Quantiles

The seminal work by Efron (1979) has laid down the landmark of a new branch of modern statistical analysis, namely bootstrap. Since then, the methods of bootstrap and smoothing have become important and practical methods in contemporary statistical analysis. Bootstrap provides a systematic way to estimate the standard errors of estimators based on resampling techniques while smoothing concerns the use of kernel function to smooth the density estimators. In this talk, I shall first provide a brief introduction to the concepts of bootstrap and smoothing, and their roles in statistical estimation and analysis. Then, I shall discuss an important research area in bootstrap and smoothing, namely the estimation of the confidence intervals for population quantiles.
Seminar page

Ryan Elmore, The Australian National University

A Fully Nonparametric Test for One-Way Layouts

We consider the use of a vector of sign statistics as the basis of a nonparametric test for equality of distributions in one-way layouts. An important feature of this test is its ability to detect a broad range of alternatives, including scale and shape differences. In this scenario, the data consist of several independent measurements on each treatment or subject. We will present finite sample and asymptotic distribution theory for our test statistics and discuss a follow-up multiple comparisons procedure based on a mixture model approximation. Finally, the methods are illustrated using two examples from the literature.
Seminar page

Debashis Paul, Stanford University

Principal Components Analysis for High Dimensional Data

Suppose we have i.i.d. observations from a multivariate Gaussian distribution with mean $\mu$ and covariance matrix $\Sigma$. We consider the problem of estimating the leading eigenvectors of $\Sigma$ when the dimension $p$ of the observation vectors increases with the sample size $n$. We work under the setup where the covariance matrix is a finite rank perturbation of identity. We show that even though the ordinary principal components analysis may fail to yield a consistent estimator of the eigenvectors, if the data can be sparsely represented in some known basis, then a scheme based on first selecting a set of significant coordinates and then applying PCA to the submatrix of sample covariance matrix corresponding to the selected coordinates, gives better estimates. Under suitable sparsity restrictions, we show that the risk of the proposed estimator has the optimal rate of convergence when measured in a squared-error type loss. We also state some results about the behavior of the eigenvalues and eigenvectors of sample covariance matrix when $p/n$ converges to a positive constant.
Seminar page

Hui Zou, Stanford University

Regularization and Variable Selection via the Elastic Net

In the practice of statistical modeling, it is often desirable to have an accurate predictive model with a sparse representation. The lasso is a promising model building technique, performing continuous shrinkage and variable selection simultaneously. Although the lasso has shown success in many situations, it may produce unsatisfactory results in some scenarios: (1) the number of predictors (greatly) exceeds the number of observations; (2) the predictors are highly correlated and form "groups". A typical example is the gene selection problem in microarray analysis.
We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors is much bigger that the number of samples. We have implemented an algorithm called LARS-EN for efficiently computing the entire elastic net regularization path, much like the LARS algorithm does for the lasso. In this talk, I will also describe some interesting applications of the elastic net in other statistical areas such as the sparse principal component analysis and the margin-based kernel classifier.
This is joint work with Trevor Hastie.
Seminar page

Ying Yuan, University of Michigan

Model-Based Estimates of the Finite Population Mean for Two-Stage Cluster Samples with Unit Nonresponse

We proposed new model-based methods for unit nonresponse in two-stage cluster samples. Specifically, we showed that the usual random-effects model estimator of the population mean (RE) is biased unless nonresponse is missing completely at random, which makes the often unrealistic assumption that the response rates are unrelated to cluster characteristics. This fact motivates modifications of RE that allow the cluster means to depend on the response rates in the clusters. Two approaches were considered, one that includes the observed response rate as a cluster-level covariate (RERR), and one based on a nonignorable probit model for response (NI1). We also considered another nonignorable model estimate of the mean (NI2) that removes the bias of RERR and NI1 when there is association between response and the survey outcome within the clusters. To incorporate covariates, we discussed a semiparametric approach, which addresses the curse of dimensionality problem.
Seminar page

Christopher Genovese, Carnegie Mellon University

Uniform Confidence Sets for Nonparametric Regression with Application to Cosmology

In this talk, I will discuss recent work on constructing confidence sets for an unknown function in nonparametric regression problems. The goal is to construct a set -- usually a ball in some space or, alternatively, bands -- that provides (asymptotically) uniform coverage for the whole function. Inferences can then be generated by searching this set, possibly with added constraints from available side information.

One approach I'll describe extends results by Beran and Dumbgen (1998) to wavelet bases and weighted L2 loss. I will demonstrate these methods through an analysis of the Cosmic Microwave Background spectrum, which is used by cosmologists to understand the physics of the early universe.

While it is possible to construct estimators in nonparametric regression that adapt to the unknown smoothness of the function, constructing adpative confidence sets is not always possible. I'll discuss this issue and describe an approach to nonparametric inference, which I call confidence catalogs, in which the end product is a mapping from assumptions to confidence sets.

This is joint work with Larry Wasserman.
Seminar page

Deborah Burr, The Ohio State University

Statistical Modelling of Data on the Connection Between Nonsteroidal Anti-inflammatory Drugs and Cancer Risk

The purported connection between NSAID use and cancer is currently receiving major attention in the medical community. Evidence is accumulating that long-term use of NSAIDs such as aspirin greatly decreases the risk of many cancers. I will present data from several studies on relative risk of colon cancer, for NSAIDs users vs. non-users, and consider meta-analysis techniques for combining the estimates of relative risk. Bayesian models are used because the ease with which they can be fit allows great flexibility. I will present tools for model selection in this context, apply them to this dataset, and show how they can be used more generally.
Seminar page

Qingxia Chen, University of North Carolina at Chapel Hill

Theory and Inference for Parametric and Semiparametric Methods in Missing Data Problems

Missing data is very common in various experimental settings, including clinical trials, surveys and environmental studies. When the missing covariates and/or missing response is missing at random (MAR) or nonignorably missing, it is quite common to specify a parametric distribution for the missing covariates as well as a parametric distribution for the missing data mechanism. However, since the true distributions are of course unknown, it would be more desirable to express the class of possible covariate distributions and/or missing data mechanisms through a class of semiparametric/nonparametric models in order to obtain more robust estimators of the regression coefficients of interest. In this talk, we first consider a class of semiparametric models for the covariate distribution and/or missing data mechanism for ignorable and nonignorable missing covariate and/or response data for general classes of regression models including generalized linear models (GLM's) and generalized linear mixed models (GLMM's). The semiparametric model consists of a generalized additive model (GAM) for the covariate distribution and/or missing data mechanism. Penalized regression splines are used to express the GAM's as a generalized linear mixed effects model, in which the variance of the corresponding random effects provides an intuitive index for choosing between the semiparametric and parametric model. Maximum likelihood estimates are then obtained via the EM algorithm. The semiparametric method proposed above can provide reasonable robust estimates for moderate departures from the true underlying covariate distribution. However, for severe departures from the true underlying model, the proposed semiparametric methods may not perform satisfactorily, and therefore, a class of fully nonparametric models is proposed for the covariate distribution with MAR covariates in generalized linear models. The estimates of the regression coefficients are obtained by maximizing a pseudo-likelihood function over a sieve space, and are shown to be consistent, asymptotically normal and most efficient semiparametrically. Another topic in this talk is to examine theoretical connections as well as small and large sample properties of three commonly used model-based methodologies dealing with missing data, Multiple Imputation (MI), Maximum Likelihood (ML) and Fully Bayesian (FB) methods. In this paper, we derive small sample and asymptotic expressions of the estimates and standard errors for these three methods, investigate the small and large sample properties of the estimates, and fully examine how these estimates are related for the three approaches in the linear regression model when the responses or covariates are missing at random (MAR). We show that when the responses are MAR in the linear model, the estimates of the regression coefficients using these three methods are asymptotically equivalent to the complete case (CC) estimates under very general conditions. With MAR continuous covariates in the linear model, we derive the imputation distribution under proper MI, the iterative formula of the estimates and closed form expressions for the standard errors under the ML method via the EM algorithm, as well as closed form full conditional distributions for Gibbs sampling under the FB framework. Simulations are conducted to demonstrate and compare the methodologies. A real dataset from a melanoma cancer clinical trial is analyzed using the proposed methods, and a liver cancer clinical trial is analyzed using the CC, MI, ML, and FB methods.
Seminar page

John Dixon, Florida State University

A Computationally Quick Bootstrap Procedure for Semiparametric Models

We introduce a computationally quick bootstrap procedure. Like the weighted bootstrap, our procedure can be used to generate random draws that approximate the joint sampling distribution of the parametric and nonparametric maximum likelihood estimators in a wide variety of semiparametric models, including several useful biased sampling and survival analysis models. But the dimension of the maximization problem for each bootstrapped likelihood is smaller. The procedure can be stated quite simply. First obtain a valid random draw for the parametric component of the model, which in many cases can be done at very low computational cost. Then take the draw for the nonparametric component to be the maximizer of the weighted bootstrap likelihood with the parametric component fixed at the parametric draw. This avoids the computationally costly computation of the parametric maximizers of the weighted bootstrap likelihoods necessary to give the parametric draws in the weighted bootstrap. We illustrate the computational savings this provides in constructing confidence bands for simulated vaccine efficacy trials.
Seminar page

Cheolwoo Park, University of Florida

Gene Selection Using Support Vector Machines with Nonconvex Penalty

With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. The fundamental problem of gene selection in cancer study is to identify which groups of genes are differentially expressed in normal and cancerous cells, and it leads to a better understanding of genetic signatures in cancer and the improvement on cancer treatment strategies. Though gene selection and cancer classification are two closely related problems, many existing approaches handle them separately by selecting genes prior to constructing the classifier rule. Our motivation is to provide a unified procedure for simultaneous gene selection and cancer classification and achieve high accuracy in both aspects. The "high dimension low sample size" structure of microarray data demands more flexible and powerful statistical tools for analysis. In this talk we introduce a novel type of regularization in support vector machines (SVM) to identify important genes for cancer classification. A special nonconvex penalty, called the smoothly clipped absolute deviation penalty, is imposed on the hinge loss function in the SVM. By systematically thresholding small estimates to zeros, the new procedure eliminates redundant genes automatically and yields a compact and accurate classifier.
Seminar page