Sam Wu, University of Florida

DNA Microarrays and Statistical Challenges

Microarray technology is a breakthrough in genetic science. It is capable of profiling gene expression patterns at a genome-wide scale in a single experiment. An individual microarray experiment can yield information on the amount of RNA transcribed in each of tens of thousands genes, and a single study can involve anything from 1 to 100s of such experiments. This talk gives a brief introduction to the microarrays and describes some of the statistical problems that arise in the area. Most of our examples will be related to the use of microarrays for cancer research and investigation of yeast cell cycle.
Seminar page

Jim Booth, University of Florida

Sorting Periodically-Expressed Genes Using Microarray Data

The publicly available datasets from yeast are excellent starting points to evaluate and develop statistical tools for handling information from DNA microarray experiments. In this paper we compare two statistical methods, Fourier analysis and singular value decomposition (SVD), for analyzing gene expression ratios in yeast cells as they progress through the cell cycle. We find that analysis of array data using Fourier analysis and using SVD can be carried out simply, without extensive data manipulations described in previous papers. We propose a circular correlation method for comparing different procedures to order genes around a unit circle (i.e., genes that are expressed in a cyclical fashion). The results of applying this method reflected the close relationship between the Fourier and SVD methods. In addition we develop a stochastic search algorithm that can be entrained on a set of genes known to be cell cycle-regulated, and then applied as a systematic and data-driven method to classify genes into specific stages of the cell cycle.
Seminar page

Cong Han, University of Minnesota

Optimal Design for Nonlinear Regression Models

Under antiviral treatment, HIV dynamics can be represented by nonlinear models. Optimal design for such models is concerned with a choice of sampling times so as to achieve high efficiency in parameter estimation. When data are analyzed separately for each patient, fixed-effects nonlinear regression models are appropriate. Analytic results will be presented regarding locally D- and c-optimal designs for a particular model, where the c-optimal design is for estimating a decay rate. When data from different subjects are analyzed simultaneously, a nonlinear mixed-effects model is often used. After a case study in analysis of HIV dynamic data is presented, optimal design issues of such models will also be discussed.
Seminar page

Tanya Logvinenko, Stanford University

Sequential Monte Carlo and Dirichlet Mixtures for Extracting Protein Alignment Models

Construction of a multiple sequence alignment can give biologists a good structural and functional insight into a protein sequence of this family. The multiple alignments of proteins are used to find the functional patterns characterising protein families, detecting homology between new sequences and existing families, predicting secondary and tertiary structure of proteins, and understanding the biological roles of a target protein. Multiple sequence alignment (MSA) can be viewed as computation of a posterior mean $E(\bTheta\:|\:S^{(1)},\:...,\:S^{(n)})$, where $\bTheta$ is a position specific profile matrix and $S^{(1)},\:...,\:S^{(n)}$ are the sequences to be aligned. $P(S^{(i)}\:|\:\bTheta)$ can be described by a Hidden Markov Model, with a ``hidden'' state being an unknown alignment path $A^{(i)}$ which generates $S^{(i)}$ from (unknown) $\bTheta$. I will describe a novel query centric Bayesian algorithm which applies Sequential Monte Carlo (SMC) framework to align the sequences and create the position specific profile matrix $\bTheta$, regarding unknown profile and paths aligning each of the sequences to the profile as missing data and imputing them sequentially. After information from all sequences is incorporated into the final profile, the sequences are realigned to the profile, and Gibbs Sampler is introduced to improve the multiple alignment. The use of SMC and Gibbs Sampler allows the procedure to avoid getting stuck in a local mode. As a base algorithm for aligning new sequences and the profile we will use an extension of a Hidden Markov Model based pairwise sequence alignment which will be briefly described. To capture information contained in a column of aligned amino acids our method uses a Drichlet Mixture prior. I will show examples of our MSA method performance and compare it with other methods.
Seminar page

Alexandre Carvalho, Northwestern University

Mixtures-of-Experts of Generalized Linear Time Series

We consider a novel class of non-linear models based on mixtures of local generalized linear time series. In our construction, at any given time, we have a certain number of generalized linear models (GLM), denoted experts, where the vector of covariates may include functions of lags of the dependent variable. Additionally, we have a latent variable, whose distribution depends on the same covariates as the experts, that determines which GLM is observed. This structure is considerably flexible, as was shown by Jiang and Tanner in a series of papers for mixtures of GLM with independent observations. For parameter estimation, we show that maximum likelihood (ML) provides consistent and asymptotically normal estimators under certain regularity conditions. We perform some Monte Carlo simulations to study the properties of the ML estimators for finite samples. Finally, we apply the proposed models to study some real examples of time series in Marketing and Finance.
Seminar page

Sean O'Brien, University of North Carolina

Flexible Versus Parsimonious Regression Models: Asymptotic Relative Efficiency of Chi-Square Tests of Association with Unequal Degrees of Freedom

In many research applications, interest centers on assessing evidence of an association between a quantitative explanatory variable and a discrete or continuous response variable in the presence of covariates. A common strategy for assessing statistical significance is to specify a parametric model for the unknown regression function and then to test whether the term or terms pertaining to the variable of interest can be deleted. For maximizing statistical power, a parsimonious and accurate representation of the effect of the explanatory variable of interest is desirable. If a choice is to be made between parsimony and accuracy, how should one proceed? To help address this question, an explicit closed-form expression is derived for the asymptotic relative efficiency of two chi-square test statistics with different degrees of freedom corresponding to two different strategies for modeling the explanatory variable of interest. This asymptotic result permits numeric evaluation and can be used to develop guidelines for selecting a model in light of uncertainty about the correct functional form. The numeric results shed light on the question of whether or not adding terms to a regression model to reduce the extent of model misspecification will increase or decrease statistical efficiency.
Seminar page

Zhengjun Zhang, University of North Carolina

Multivariate Extremes, Max-Stable Process Estimation and Dynamic Financial Modeling

Studies have shown that time series data from finance, insurance and environment etc. are fat tailed and clustered when extremal events occur. In an effort to characterize such extremal processes, max-stable processes or min-stable processes have been proposed since the 1980s and some probabilistic properties have been obtained. However, applications are very limited due to the lack of efficient statistical estimation methods. Recently, the author has shown some probabilistic properties of the processes and proposed a series of estimation procedures to estimate the underlying max-stable processes, i.e., multivariate maxima of moving maxima processes. In this talk, I will present some basic properties, estimating procedures of multivariate extremal processes, and illustrate how to model financial data as moving maxima processes. Examples will be illustrated with GE, Citibank, Pfizer stock index data.
Seminar page

Sujuan Gao, Indiana University School of Medicine

Analysis of Longitudinal Binary Data with Non-Ignorable Missing Values

Many longitudinal dementia studies on elderly individuals suffer from a significant amount of missing data, most of which are due to death of the study participants. It is generally believed that these missing data by death are non-ignorable to likelihood based inference. Inference based on data from surviving individuals only may lead to biased results. I will present three approaches to deal with these missing data in dementia studies. The first one adopts the selection model framework where the probability of missing is explicitly modeled to depend on the missing outcomes. The second approach models both the probability of disease and the probability of missing using shared random effect parameters. Lastly, we will set up an illness-death stochastic model to simultaneously estimate disease incidence and mortality rates. Data from a longitudinal dementia study will be use to illustrate all three approaches.
Seminar page

Dylan Small, Stanford University

Overdetermined Estimating Equations with Applications to Panel Data

Panel data has important advantages over purely cross-sectional or time-series data in studying many economic problems, because it contains information about both the intertemporal dynamics and the individuality of the entities being investigated. A commonly used class of models for panel studies identifies the parameters of interest through an overdetermined system of estimating equations. Two important problems that arise in such models are the following: (1) It may not be clear {\it{a priori}} whether certain estimating equations are valid. (2) Some of the estimating equations may only ``weakly'' identify the parameters of interest, providing little information about these parameters and making inference based on conventional asymptotic theory misleading. A procedure based on empirical likelihood for choosing among possible estimators and selecting variables in this setting is developed. The advantages of the procedure over other approaches in the econometric literature are demonstrated through theoretical analysis and simulation studies. Related results on empirical likelihood, the generalized method of moment and generalized estimating equations are also presented.
Seminar page

Brian Smith, University of Iowa

A Bayesian Framework for Analyzing Exposure Data from the Iowa Radon Lung Cancer Study

A Bayesian approach is developed for modeling the association between lung cancer risk and residential radon exposure measured with error. Markov chain Monte Carlo (MCMC) methods are used to fit the model to data from the Iowa Radon Lung Cancer Study. Fast and efficient C++ libraries were written to facilitate the use of MCMC methods with the large Iowa dataset. The proposed methodology introduces a measurement model for the radon process and uses the modeled radon concentrations as the exposure variable in the risk model. Reducing the measurement error in the exposure variate can lead to improved estimates of the relationship between lung cancer risk and radon exposure. The disease and radon process are modeled jointly so that both the measured radon concentrations and the risk information contribute to the estimated true radon exposures. This joint modeling approach has the potential to improve both the estimated true exposures and the measured association between exposure and disease risk. The Bayesian risk model is applied to data from the Iowa Radon Lung Cancer Study.
Seminar page

Yoonkyung Lee, University of Wisconsin

Multicategory Support Vector Machines

The Support Vector Machine (SVM) has become a popular choice of classification tool in practice. Despite its theoretical properties and its empirical success in solving binary problems, generalization of the SVM to more than two classes has not been obvious. Oftentimes multicategory problems have been treated as a series of binary problems in the SVM paradigm. However, solutions to a series of binary problems may not be optimal for the original multicategory problem. We propose multicategory SVMs, which extend the binary SVM to the multicategory case, and encompass the binary SVM as a special case. The proposed method deals with the equal misclassification cost and the unequal cost case in a unified way. It is shown that the multicategory SVM implements the optimal classification rule for appropriately chosen tuning parameters as the sample size gets large. The effectiveness of the method is demonstrated through simulation studies and real applications to cancer classification problems using gene expression data.
Seminar page

Zhengyuan Zhu, University of Chicago

Design and Inference for Gaussian Random Fields

Gaussian random fields (GRFs) can be used to model many physical processes in space. In this talk we present two kinds of results for GRFs: spatial sampling design and covariance parameter estimation. We study spatial sampling design for prediction of stationary isotropic GRFs with estimated parameters of the covariance function. The key issue is how to incorporate the parameter uncertainty into the design criteria. Several possible design criteria are discussed. An annealing algorithm is used to search for optimal designs of small sample size and a two-step algorithm is proposed for moderately large sample sizes. Simulation results are presented for the Mat\'ern class of covariance functions. The inference issue we consider is the asymptotic properties of estimates of parameters of fractional Brownian motion. We give the fixed-domain asymptotic distributions of both least square and maximum likelihood estimates, which are different from the more standard increasing-domain asymptotic results. We discuss why these results should still apply when the process is not fractional Brownian motion but instead a GRF with covariance function in the Mat\'ern class.
Seminar page

Yun-Xin Fu, University of Texas at Houston

Some Statistical and Computational Challenges of Population Genetics After The Human Genome Project

Genomic study has entered a new era with the completion of the first draft of Human genome. Human genome harbors millions of single nucleotide polymorphic sites (SNPs), identification and characterization of which is an area of intense research and competition. SNPs are and will be used in many biomedical and population studies, and it is fundamentally important to understand the statistical properties of SNPs. Coalescent theory, a branch of theoretical population genetics, offers excellent tools for studying SNPs. I will review basic components of the theory and recent advances that are relevant for studying SNPs. I will discuss the problem of ascertainment bias, which arises due to use of a small sample to detect SNPs, and present a couple of examples on the application of coalescent theory. Many analyses of genomic data require considerable computational resource, I will discuss a solution for building a powerful yet inexpensive computation farm based Java technology.
Seminar page

Florentina Bunea, Florida State University

Penalty Choices and Consistent Covariate Selection in Semiparametric Models

We suggest a model selection approach for estimation in semiparametric regression models and investigate the compatibility of the following optimality aspects: consistent covariate selection of the parametric component, asymptotic normality of the {\it selected} estimator of the parametric part and adaptive estimation of the nonparametric component. We show that these goals cannot be attained simultaneously by a direct extension of standard parametric or nonparametric model selection methods. We introduce a new type of penalization, tailored to semiparametric models and present one and two stage estimation procedures, discussing their respective merits. Examples of models to which this methodology applies include: partially linear regression, generalized semilinear regression models and semiparametric hazard function regression models. We illustrate our method and present a simulation study for the partially linear regression model.
Seminar page

Ron Randles, University of Florida

What do we MEAN by the MEDIAN?

This talk is a survey of multivariate extensions of the univariate median. Particular emphasis is placed on multivariate generalizations that have the property of affine-equivariance, the benefits of which are explained. A new multivariate median is discussed which has some desirable properties, the most important of which is that it is easy to compute in common dimensions.
Seminar page

Mike Daniels, Iowa State University

Modelling Dependence in Longitudinal Data

Modelling the dependence structure in longitudinal data is often not given as much attention as modelling the mean structure. However, carefully modelling the dependence can be important in a variety of situations. Clearly, if the goal is prediction, such modelling is very important. In addition, when there is missing data that is thought to be missing at random (MAR) or non-ignorable, incorrectly modelling the dependence can result in biased estimates of fixed effects. Bias can also be a problem when the random effects covariance matrix is mis-modelled in generalized linear mixed models. In general, correctly modelling the dependence will increase the efficiency of estimators for fixed effects. We will propose several approaches to model dependence. In particular, we will discuss parameterizations of, and prior distributions for, the covariance matrix in the context of shrinking toward a structure and in developing parametric models with covariates (heterogeneous covariance structures). The results of simulations to assess these approaches will be reported and computationally efficient approaches to compute estimators and fit models will be proposed. Some of these approaches will be applicable for modelling both the 'marginal' covariance matrix (applicable for continuous normal data) and the random effects covariance matrix (applicable for continuous and discrete data). Data collected from a series of depression trials will be used to illustrate these models.
Seminar page

Xuelin Huang, University of Michigan

Analysis of Correlated Survival Data With Dependent Censoring

In biomedical studies, patient survival time may be censored by many events. Some censoring events, such as patient voluntary withdrawal or changing treatment, may be related to potential failure. This is called dependent censoring in survival analysis. However, due to the identifiability problem, these events are often assumed to be independent of failure, even though this false assumption can lead to biased inference. It is also common that subjects in the study are naturally clustered, for example, they are patients from multiple medical centers, or they are family members, or they are litter mates. Subjects in the same cluster share some common factors. Therefore, their survival outcomes are likely to be correlated, instead of independent each other. In these cases, we have correlated survival data. In my talk, I will consider these two problems (dependent censoring and correlated data) simultaneously. I developed a test to check if dependent censoring is present. I also developed a model to analyze correlated survival data with dependent censoring. The EM algorithm is used to fit the model. In the E-steps, integrals without a closed form are evaluated by Markov Chain Monte Carlo method. Metropolis-Hastings algorithm is used to construct the Markov Chain of random numbers with a desired distribution. Simulation studies and analysis of a real data set of kidney disease patients are provided.
Seminar page

Warren Gilchrist, Sheffield Hallam University (U.K.)

A New/Old Approach to Statistical Modelling

In the 1880s and 1890s Francis Galton and Karl Pearson argued about the foundations of Statistics and the best way forward. Pearson won. As often, later history was re-written from the view of the victors. This talk explores a possible alternative history, had Galton won. The focus is particularly on statistical modelling.
Seminar page

Roger Berger, North Carolina State University

Stepwise Intersection-Union Tests with Applications in Biostatistics

The stepwise intersection-union test method allows multiple hypotheses to be tested with some hypotheses possibly rejected and others not rejected. This method of multiple testing is particularly simple because each hypothesis is tested with an alpha level test, but the familywise error rate of the method is also alpha. In this talk, the stepwise intersection-union test method will be explained, and its use will be illustrated in several biostatistical applications. These applications include testing primary, secondary, and tertiary outcomes; determining a minimum effective dose; and estimating the onset and duration of a treatment effect.
Seminar page

Walter Piegorsch, University of South Carolina

Quantifying Environmental Risk via Low-Dose Extrapolation

Statistical models and methods are discussed for quantifying the risk associated with exposure to environmental hazards. Attention is directed at problems in dose-response/concentration-response modeling when low-dose risk estimation is a primary goal. Possible low-dose measures include observed-effect levels, effective doses/effective concentrations, added risk/extra risk measures, and benchmark doses.
Seminar page

Linda Young, University of Nebraska


Seminar page