Andre Khuri, University of Florida

Response Surface Methods for Multiresponse Experiments

The purpose of this seminar is to provide some coverage of the so-called multiresponse surface methodology. This is a particular area in the design and analysis of experiments dealing with multiresponse experiments. In such experiments, several response variables can be measured, or observed, for each setting of a group of control variables. Quite often, the responses are correlated. It is therefore more appropriate to use multivariate techniques to analyze data from such responses. The seminar will review the following topics:
Seminar page

George Casella, University of Florida

Objective Bayes Variable Selection (or What I Did on My Spanish Vacation)

A novel fully automatic Bayesian procedure for variable selection in normal regression models is proposed, along with computational strategies for model posterior evaluation. A stochastic search algorithm is given, based on the Metropolis-Hastings Algorithm, that has a stationary distribution proportional to the model posterior probabilities. The procedure is illustrated on both simulated and real examples.
Seminar page

Haipeng Shen, The Wharton School, University of Pennsylvania

Statistical analysis of a Telephone Call Center: A Queueing Science Perspective

A call center is a service network in which agents provide telephone-based services. Customers that seek such services may be delayed in tele-queues, which are invisible to them. The talk summarizes an analysis of a unique record of call center operations. The data comprise a complete operational history of a small banking call center, call by call, over a full year. Taking the perspective of queueing theory, we decompose the service process into three fundamental components: arrivals, waiting times, and service durations. Each component involves different basic mathematical structures and requires a different style of statistical analysis. Some of the key results will be sketched, along with descriptions of the varied techniques required. In conclusion we survey how the characteristics deduced from the statistical analyses form the building blocks for theoretically interesting and practically useful mathematical models for call center operations. This reports on joint work with Larry Brown, Linda Zhao and Noah Gans from Wharton, and Avishai Mandelbaum, Anat Sakov and Sergey Zeltyn from Technion (Israel).
Seminar page

Ji Zhu, Stanford University

Microarry Classification; Support Vector Machine, Kernel Logistic Regression and Import Vector Machine

n the first part of the talk, I will talk about microarray classification. Classification of patient samples is an important aspect of cancer diagnosis and treatment. We propose a simple model, penalized logistic regression (PLR), for the microarray cancer diagnosis problem. A fast algorithm for solving PLR is described. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods used in the literature, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select less genes than other methods and also performs well in both cross-validation and test samples. In the second part of the talk, I will talk about the support vector machine, kernel logistic regression and import vector machine. The support vector machine (SVM) is known for its good performance in binary classification, but its extension to multi-class classification is still an on-going research issue. In this talk, we propose a new approach for classification, called the import vector machine (IVM), which is built on kernel logistic regression (KLR). We show that the IVM not only performs as well as the SVM in binary classification, but also can naturally be generalized to the multi-class case. Furthermore, the IVM provides an estimate of the underlying probability. Similar to the ``support points'' of the SVM, the IVM model uses only a fraction of the training data to index kernel basis functions, typically a much smaller fraction than the SVM. This gives the IVM a computational advantage over the SVM, especially when the size of the training data set is large.
Seminar page

Rick Cleary, Bentley College

An Overview of Benford's Law with Applications to Auditing

Benford's law proposes a distribution of digits, most notably first digits, in measurements that span many orders of magnitude. Auditors have begun using Benford's law as part of fraud detection schemes in a variety of settings. It is well known, however, that Benford's law does not apply in certain conditions, such as when the data is all of the same order of magnitude. In this presentation we give an overview of Benford's law and some ways to use it as a teaching tool. We discuss some diagnostic procedures for deciding when Benford's law should apply, and the use of these diagnostics in practice by auditors.
Seminar page

Yun Ju Sung, University of Minnesota

Misspecification Error in Missing Data Models

When a statistical model is incorrect, the MLE is inconsistent, converging to the minimizer $\theta^*$ of Kullback-Leibler information. Any difference between the density $f_{\theta^*}$ and the true density $g$ is error due to model misspecification. We propose a Monte Carlo method to find $\theta^*$ when there are missing data and the observed data likelihood doesn't have closed form. The motivating example was models for mutation accumulation data from statistical genetics. We prove consistency and asymptotic normality of the Monte Carlo estimate of $\theta^*$. The method involves generating two samples, the first for observed data from the true density and the second for missing data from an importance sampling density. The entire second sample is used with each member of the first sample. We show that this results in an asymptotic variance for the estimate smaller than that obtained by using the first sample only once. If nature, instead of a computer, generates the first sample, then our estimate is a Monte Carlo approximation to the MLE. Now its asymptotic variance reflects sampling variability of the first sample and Monte Carlo variability of the second sample.
Seminar page

Surajit Ray, Penn State University

Distance-based Model-selection with Application to the Analysis of Gene Expression Data

Multivariate mixture models provide well known and widely used methods for density estimation, model-based clustering, and explanations for the data generation process. However, the problem of choosing the number of components of a mixture model in a statistically meaningful way is still a subject of considerable research. I introduce several rules for selecting a finite mixture model, and hence estimating the number of components, using quadratic distance functions. In one approach, the goal is to find the minimal number of components that are needed to adequately describe the true distribution, where the decision is based on a nonparametric confidence set for the true distribution. Two alternative approaches for estimating the number of components are based on density concordance and risk analysis. In the density concordance approach the resulting concordance curves have properties which determine the amount of variability (in the empirical density) explained by the model. In the risk analysis approach to model selection, I demonstrate how the distance can be decomposed into two parts pertaining to (1) the lack of fit of the model and (2) the cost of parameter estimation. Applications of my methods to the analysis of gene expression data will be presented during the talk.
Seminar page

Trevor Park, Cornell University

Rotation of Principal Components: A Penalized Likelihood Approach

Principal component analysis remains a standard tool for exploratory multivariate analysis, recently finding renewed use in exploration of functional data. To facilitate interpretation of individual components, investigators in many applied sciences sometimes choose to perform a rotation on selected groups of components --- usually a subspace-preserving orthogonal transformation of the component direction vectors that brings them into closer alignment with an easily interpretable orthogonal basis for the space. Until recently, principal component rotation has not received much attention from statisticians, perhaps because it apparently lacks a formal statistical foundation. This talk will introduce a new framework for rotation via maximizing a penalized profile likelihood based on the multivariate Gaussian case. Likelihood provides an appropriate quantification of component ill-definedness, while rotation criteria like those used in factor analysis can serve as penalty functions for encouraging component interpretability. A single penalty parameter smoothly controls the degree of rotation, from the original principal components to components fully aligned with the axes, with ill-defined components having the greatest susceptibility to rotation. The connection between likelihood and approximate confidence regions provides a new way to measure the degree to which rotated components are consistent with the data. Although the problem of maximizing the penalized likelihood is generally not analytically tractable, numerical solutions can be computed efficiently to any level of accuracy with the assistance some recently introduced algorithms designed for orthogonality-constrained optimization.
Seminar page

Rajeshwari Natarajan, Southern Methodist University

Inverse Gaussian and Gaussian Analogies

The Inverse Gaussian (IG) distribution is potentially more useful in practice than the better known Gaussian distribution. The IG distribution goes back to Schrodinger (1915), Schmoluchowski (1915), Wald (1947) and Tweedie (1947), whereas the normal distribution can be traced to De Moivre (1738), about a century before Gauss popularized it. The two parameter IG distribution is ideally suited for modelling non-negative positively skewed data, which arose out of the analysis of Brownian motion and now is used for analyzing data from fields as diverse as ecology and the internet. The distribution is intriguingly similar to the normal distribution in many respects, and the inference methods associated with it use well-known normal theory entities such as t, chi-square, and F tests. In this talk, we discuss the Inverse Gaussian and Gaussian Analogies with some emerging results.
Seminar page

David Donoho, Stanford University

Hessian Eigenmaps for Nonlinear Dimensionality Reduction

Suppose I look at many pictures of a face gesturing. I know that underlying each of those pictures there is a set of face muscles with a set of 'parameters' controlling the extension of the muscle. Can I learn, simply from looking at lots of such pictures and with no other assistance, the parameters underlying such pictures? I'll discuss two articles in the journal 'Science' that introduced methods for analysing databases of articulated images... e.g. many pictures of a face gesturing or many pictures of a hand gesturing or pictures of a vehicle from many different positions. The methods, ISOMAP and Local Linear Embedding, claimed to find the hidden `rule' (parametrization) lying behind the image database. The methods form part of the large body of techniques for discovering the structure of data lying on manifolds in high-dimensional space. I'll discuss what I view to be weaknesses in our understanding based on the original articles, along with subsequent research that has clarified these issues. We now have mathematical theory showing that certain classes of image databases (e.g. gesturing cartoon faces) can be analysed perfectly by an improvement of ISOMAP and LLE which we call the Hessian Eigenmap, which I will explain and apply. Joint work with Carrie Grimes.
Seminar page

Steven Roberts, Stanford University

Particulate Air Pollution Mortality Time Series Studies

In recent years, there has been much interest in the health effects of particulate air pollution. There have been many time series studies that have shown increases in particulate air pollution increase the expected mortality rate. In this talk, I will review how a typical Particulate Air Pollution Mortality Time Series Study is conducted. I will then focus on two potential problems of such studies: Mortality displacement and combining particulate air pollution data from multiple monitors.
Seminar page

Haihong Li, Frontier Science and Technology Research Foundation

Local quasi-likelihood method for generalized random curve models for longitudinal data

We consider a class of generalized random curve models for continuous or discrete longitudinal data. The subject-specific smooth curve is decomposed into two components, a population (fixed) curve and a subject-specific (random) curve. The local quasi-likelihood method is developed to fit the proposed models. Our modeling approach allows us to estimate not only the population curve, but also the individual curves. We establish the asymptotic results of the resulting estimators from which inference procedures are derived. The proposed models and methods are applied to a longitudinal binary data set from an AIDS clinical study. We also conduct a simulation to study the finite sample properties of the proposed estimators.
Seminar page

Eric Chicken, Florida State University

Block-dependent thresholding in wavelet regression

Nonparametric regression via wavelets is usually implemented under the assumptions of dyadic sample size, equally spaced and fixed sample points, and i.i.d. normal errors. By applying linear transformations to the data and block thresholding to the discrete wavelet transform of the data, one can still achieve optimal rates of convergence, fast computational time, and spatial adaptivity for functions lying in Holder spaces even for data that does not possess the above three assumptions. The thresholds are dependent on the varying levels of noise in each block of wavelet coefficients, rather than on a single estimate of the noise as is usually done. This block-dependent method is compared against term-by-term wavelet methods with noise-dependent thresholding via theoretical asymptotic convergence rates as well as by simulations and comparisons on a well-known data set.
Seminar page

Ying Qing Chen, University of California at Berkeley

Beyond the Proportional Hazards Models

Survival data have been studied extensively in statistical literature. A variety of regression models are also developed. Among these models, the most successful one is the Cox proportional hazards model. In this talk, we will first discuss the potential limitations of this model and other available models in practical data analysis. Then we will present a series of works in alternative modelling strategies and model selections to cope with these limitations. An actual randomized clinical trial data will be analyzed for the new methodologies throughout the demonstration.
Seminar page

Margaret Short, University of Minnesota

Covariate-adjusted spatio-temporal cumulative distribution functions with application to air pollutant data

We provide a fully hierarchical approach to the modeling of spatial cumulative distribution functions (SCDFs), using a Bayesian framework implemented via Markov chain Monte Carlo (MCMC) methods. The approach generalizes the SCDF to accommodate block-level variables, possibly utilizing a spatial change of support model within an MCMC algorithm. We then extend our approach to allow covariate weighting of the SCDF estimate. We further generalize the framework to the bivariate random process setting, which allows simultaneous modeling of both the responses and the weights. Once again MCMC methods (combined with a convenient Kronecker structure) enable straightforward estimates of weighted, bivariate, and conditional SCDFs. A temporal component is added to our model, again implemented with a Kronecker product covariance stucture, corresponding to separable correlations. We illustrate our methods with two air pollution data sets, one concerning ozone exposure and race in Atlanta, GA, and the other recording both NO and NO_2 ambient levels at 67 monitoring sites in central and southern California.
Seminar page

Jim Zidek, University of British Columbia

Uncertainty

"Uncertainty", like its complementary cousin, "information", is a much used but not very well defined concept despite its intrinsic role in statistics. (Indeed, that latter is often described as the "science of uncertainty".) In this talk, I will explore some of the meanings (from the manuscript written with Constance van Eeden) that are ascribed to that term and readily discover that seemingly natural questions can have answers that are either elusive or counter-intuitive. For example, surprisingly (in answer to one of those questions), the level of uncertainty (according to one defintion) can actually increase rather than decrease as the amount of information increases. For other definitions we have not been able to give general answers to that question. I will also address the issue of combining information to reduce uncertainty. Specifically, I will survey some recent work including that with Malay Ghosh and Constance van Eeden using the weighted likelihood in conjunction with samples from populations different from, but similar to that under study. That resemblence can lead to very effective trade-offs of bias for precision when it derives from structural relations among the various population parameters, for example, when the difference in the population means may be bounded by a fixed constant.
Seminar page

William Cleveland, Statistics Research, Bell Labs

Statistical Multiplexing: Math and Stat Take Over the Internet

When two hosts communicate over the Internet --- for example, when a Web page is down-loaded from a server to a PC --- the two hosts set up a connection and a file is broken up into packets that are transmitted over a path made up of routers connected by transmission links. An Internet link typically carries the packets of many active connections. The packets of the different connections are intermingled on the link; for example, if there are three active connections, the arrival order of 10 consecutive packets by connection number might be 1, 1, 2, 3, 1, 1, 3, 3, 2, and 3. This intermingling is referred to as ``statistical multiplexing'' in the Internet engineering literature, and as ``superposition'' in the literature of point processes. True, network devices put the packets on Internet links and do the multiplexing, but then the mathematical and statistical laws of stochastic processes take over. Extensive empirical and theoretical studies of detailed packet data, inter-arrival and size time series, reverse the commonly-held belief that Internet traffic is everywhere long-range dependent, or bursty. The magnitude of the statistical multiplexing has a dramatic effect on the statistical properties of the time series; as the magnitude increases, the burstiness becomes less and less significant. The magnitude needs to become part of the fundamental conceptual framework that guides the study of Internet traffic. These results have critical implications for Internet engineering.
Seminar page