Jeff Gill,   Dept. of Government,   Harvard University.

Dirichlet Process Priors for Bayesian Models with an Application to Bureaucratic Politics.

In this paper, we advocate greater attention to the use of prior specifications in the statistical analysis of social science data, recommend a specific nonparametric variant based on a mixture of Dirichlet processes that has not yet been applied in the social sciences, and develop a new Bayesian stochastic simulation procedure to handle the resulting estimation challenges. We apply the Dirichlet process prior to a Bayesian hierarchical model for ordered choices to address a difficult question in bureaucratic politics. These Dirichlet mixtures represent a new paradigm for semi-informed prior information that reflects both information from observations and researcher intuition, where neither dominates. (This is joint work with G. Casella.)

Seminar page

Youngjo Lee,   Dept. of Statistics,   Seoul National University.

Beyond Generalized Linear Models.

We introduce joint generalized linear models, hierarchical generalized linear models, and double hierarchical generalized linear models, by adding new features in the original generalized linear models. We discuss how these features in models can be used for the analysis of data.

Seminar page

Chunpeng Fan,   Dept. of Statistics,   University of Wisconsin.

Optimal Inferences for Proportional Hazards Model with Parametric Covariate Transformations.

The traditional Cox proportional hazards model assumes a log-linear relationship between covariates and the underlying hazard function. However, in real data, this linear relationship may be invalid. One example is data from a cancer clinical trial, National Surgical Adjuvant Breast and Bowel Project (NSABP) study (B-20), in which the relationship between log hazard and progesterone receptor level is nonlinear when analyzing disease free survival. We propose a generalized Cox model which allows parametric covariate transformations to recover the linearity. However, while the proposed generalization may seem rather simple, the inferential issues are quite challenging because of loss of identifiability under the null of no effects of transformed covariates. Optimal tests are derived for such null hypothesis for certain alternatives. Rigorous inferences for parameters and unspecified baseline hazard function are established when regularity conditions hold and the transformed covariates have non-zero effects. The estimates and tests perform well in simulation studies with realistic sample size, and the proposed tests are generally more powerful than the usual partial likelihood ratio test with fixed transformations or sup partial likelihood ratio test. Protocol B-20 data is used to illustrate the model building procedure and the better fit of the proposed model, comparing to traditional Cox model.

Seminar page

Jiguo Cao,   Dept. of Epidemiology and Public Health,   Yale University.

The Parameter Cascade Method and its Applications in Estimating Air Pollution Models, Fitting HIV Dynamical Models, and Constructing Gene Regulation Network.

Many statistical models involve three levels of parameters: (i) nuisance parameters are parameters required to construct models but not of direct interest, (ii) structural parameters are parameters holding our main concern, and (iii) complexity parameters are parameters controlling the effective degrees of freedom of the models. The current methods for estimating these models are computational intensive and inoperable for "naive" users. In our talk, we will introduce a new method, the parameter cascade method, which estimates parameters in three nested levels of optimization and defines the nuisance parameters as regularized functions of structural parameters, and structural parameters in turn as functions of complexity parameters. This approach has several unique aspects. Firstly, the computation is fast and stable with the gradients and Hessian matrices worked out analytically. Secondly, the unconditional variance estimates are attained, which include the uncertainty coming from other parameter estimates. Finally, each level allows for a different optimization criterion, otherwise the biases in parameter estimation may be induced.

The parameter cascade method will be illustrated by estimating generalized semiparametric additive models for air pollution data, fitting HIV dynamical models (ordinary differential equations) to clinical trials, and constructing gene regulation networks from time course microarray data.

Seminar page

Xiaomin Lu,   Dept. of Statistics,   North Carolina State University.

Improving the Efficiency of the Logrank Test Using Auxiliary Covariates.

The logrank test is widely used in many clinical trials for comparing the survival distribution between two treatments with censored survival data. Under the assumption of proportional hazards, it is optimal for testing the null hypothesis of H0 : β = 0, where β denotes the logarithm of the hazard ratio. In practice, additional auxiliary covariates are collected together with the survival times and treatment assignment. If the covariates correlate with survival times, making use of their information will increase the efficiency of the logrank test. In this paper, we apply the theory of semiparametrics to characterize a class of regular and asymptotic linear estimators for β when auxiliary covariates are incorporated into the model, and derive estimators that are more efficient. The Wald tests induced by these estimators are shown to be more powerful than the logrank test. Simulation studies and a real data from ACTG 175 are used to illustrate the gains in efficiency.

Seminar page

Per Mykland,   Dept. of Statistics,   University of Chicago.

Financial Data and the Hidden Semimartingale Model.

The availability of high frequency data for financial instruments has opened the possibility of accurately determining volatility in small time periods, such as one day. Recent work on such estimation indicates that it is necessary to analyze the data with a hidden semimartingale model, typically by the addition of measurement error. We review the emerging theory on this subject, including two- and multiscale sampling. We also consider broader error schemes, through Markov kernels and such phenomena as rounding due to discreteness of prices. Finally, we discuss the possibility of adapting likelihood theory to inference problems of this type.

Seminar page

Arthur Berg,   Dept. of Mathematics,   University of California (San Diego).

Higher-Order Accurate Nonparametric Function Estimation.

Higher-order accurate nonparametric estimation via infinite-order kernels will be discussed with special attention given to polyspectra estimation in Time Series and hazard function estimation in Survival Analysis. A novel bandwidth selection algorithm and an interesting connection between polyspectra and group representations will also be addressed.

Seminar page

Wenxuan Zhong,   Dept. of Statistics,   Harvard University.

Variable Selection Using Single Index Models for Motif Discovery.

Information for regulating a gene's transcription is contained in the conserved patterns (motifs) on the upstream/downstream DNA sequence (promoter region) close to the target gene. By combining the information contained in both gene expression measurements and genes' promoter sequences, we proposed a novel procedure for identifying functional active motifs under certain stimuli. A nonlinear regression model, single index model, was used to associate promoter sequence information of a gene and its mRNA expression measurements. Single index models postulate that the response variable y depends on a unique linear combination of predictors X through an unknown link function ƒ : y = ƒ(X β, ε), where β is an index vector and ε represents measurement errors. In this talk, we will describe computationally efficient variable selection procedures and criteria, which were developed by us under profile likelihood frameworks for the single index model. We will also demonstrate the advantage of these methods both theoretically and empirically. Compared with existing methods, our proposed procedures can greatly improve variable selection sensitivities and specificities.

Seminar page

Christian Robert,   Centre de Recherche en Mathématiques de la Décision,   Université de Paris - Dauphine.

Bayesian k-nearest neighbour classification.

The k-nearest neighbour procedure uses a training dataset {(y1,x1), . . . , (yn,xn)} to make predictions on new unlabelled data, where yi{C1, . . . , CG} denotes the class label of the i-th point and xi denotes a vector of p predictor variables. The prediction for a new point (yn+1,xn+1) is reported as the most common class found amongst the k-nearest neighbours of xn+1 in the set {x1, . . . , xn}. The neighbours of a point are defined via a distance metric ρ(xn+1,xi), which is commonly taken to be the Euclidean norm. The k-nearest-neighbour algorithm is a nonparametric procedure. Traditionally, the value of k is chosen by minimizing the cross-validated misclassification rate. Here we propose some alternative approaches using probability models and Bayesian perspectives. A novel perfect sampling algorithm for resolving difficulties linked with MRF constants is introduced on the way. (This is joint work with G. Celeux, J.-M. Marin and D.M. Titterington.)

Seminar page

Subharup Guha,   Dept. of Biostatistics,   Harvard University.

Gauss-Seidel Estimation of Generalized Linear Mixed Models with Application to Poisson Modeling of Spatially Varying Disease Rates.

Generalized linear mixed models (GLMMs) provide an elegant framework for the analysis of correlated data. Due to the non-closed form of the likelihood, GLMMs are often fit by computational procedures like penalized quasi-likelihood (PQL). Special cases of these models are generalized linear models (GLMs), which are often fit using algorithms like iterative weighted least squares (IWLS). High computational costs and memory space constraints often make it difficult to apply these iterative procedures to data sets with very large number of cases.

We propose a computationally efficient strategy based on the Gauss-Seidel algorithm that iteratively fits sub-models of the GLMM to subsetted versions of the data. Additional gains in efficiency are achieved for Poisson models, commonly used in disease mapping problems, because of their special collapsibility property which allows data reduction through summaries. The strategy is applied to investigate the relationship between ischemic heart disease, socioeconomic status and age/gender category in New South Wales, Australia, based on outcome data consisting of approximately 33 million records.

Seminar page

Shaoli Wang,   Dept. of Epidemiology and Public Health,   Yale University.

Directional Regression for Dimension Reduction.

Dimensionality is a major concern in many modern statistical problems. In regression analysis, dimension reduction means to reduce the dimension of predictors without loss of information on the regression. Dimension reduction proves to be particularly useful during the model development and criticism phases as it usually does not require any pre-specified parametric models for regression. We propose a Directional Regression (DR) method for dimension reduction. This novel method naturally synthesizes dimension reduction methods based on first two conditional moments, such as Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), and in doing so combines the advantages of these methods. Under mild conditions, it provides exhaustive estimate of the Central Dimension Reduction Subspace (CDRS). We also develop the asymptotic distribution of the Direction Regression estimator, and therefore establish a sequential test procedure to determine the dimension of the Central Dimension Reduction Subspace. The Directional Regression is compared with existing methods via simulation. An application to a handwritten digit recognition problem is also presented.

Seminar page

George Casella,   Dept. of Statistics,   University of Florida.

Objective Bayes Variable Selection: Some Methods and Some Theory.

A fully automatic Bayesian procedure for variable selection in normal regression model has been developed, which uses the posterior probabilities of the models to drive a stochastic search. The posterior probabilities are computed using intrinsic priors, which are default priors as they are derived from the model structure and are free from tuning parameters. The stochastic search is based on a Metropolis-Hastings algorithm with a stationary distribution proportional to the model posterior probabilities. The performance of the search procedure is illustrated on both simulated and real examples, where it is seem to perform admirably. However, until recently such data-based evaluations were the only performance evaluations available.

It has long been known that for the comparison of pairwise nested models, a decision based on the Bayes factor produces a consistent model selector (in the frequentist sense), and we are now able to extend this result and show that for a wide class of prior distributions, including intrinsic priors, the corresponding Bayesian procedure for variable selection in normal regression is consistent in the entire class of normal linear models. The asymptotics of the Bayes factors for intrinsic priors are equivalent to those of the Schwarz (BIC) criterion, and allow us to examine what limiting forms of proper prior distributions are suitable for testing problems. Intrinsic priors are limits of proper prior distributions, and a consequence of our results is that they avoid Lindley's paradox. The asymptotics further allow us to examine some selection properties of the intrinsic Bayes rules, where it is seen that simpler models are clearly preferred. (This is joint work with Javier Girón and Elías Moreno.)

Seminar page

Mike Daniels,   Dept. of Statistics,   University of Florida.

Joint Models for the Association of Longitudinal Binary and Continuous Processes with Application to a Smoking Cessation Trial.

Joint models for the association of a longitudinal binary and a longitudinal continuous process are proposed for situations where their association is of direct interest. The models are parameterized such that the dependence between the two processes is characterized by unconstrained regression coefficients. Bayesian variable selection techniques are used to parsimoniously model these coefficients. An MCMC sampling algorithm is developed for sampling from the posterior distribution, using data augmentation steps to handle missing data. Several technical issues are addressed to implement the MCMC algorithm efficiently. The models are motivated by, and are used for, the analysis of a smoking cessation clinical trial in which an important question of interest was the effect of the (exercise) treatment on the relationship between smoking cessation and weight gain. (This is joint work with Xuefeng Liu.)

Seminar page

Zhao-Bang Zeng,   Dept. of Statistics,   North Carolina State University.

Gene Expression Quantitative Trait Loci Analysis.

Quantitative Trait Loci (QTL) analysis has changed the study of genetic basis of complex traits. With the availability of dense genome-wide molecular markers, it is now possible to map many biologically, medically or agriculturally important QTL into genomic regions for further study. However, there is a major limitation for QTL study. The genomic regions for many mapped QTL are too large for the identification of causal genes. Recently gene expression microarray technology has been used in combination with molecular markers to help for the gene identification and also for elucidating the genetic pathways for complex traits. In this talk, we will outline the statistical problems and discuss challenges for gene expression QTL analysis. In particular, we will discuss our recent research results on multiple interval mapping for eQTL and eQTL Viewer.

Seminar page

Bin Han,   Dept. of Statistics,   Pennsylvania State University.

A Bayesian Approach to False Discovery Rate for Large Scale Simultaneous Inference.

Microarray data and other applications have inspired many recent developments in the area of large scale inference. For microarray data, the number of tests of differential gene expression ranges from 1,000 to 100,000. The traditional family-wise type I error rate (FWER) is over-stringent in this context due to the large scale of simultaneity. More recently, false discovery rate (FDR), was defined as the expected proportion of type I errors among the rejections. Controlling the less stringent FDR criterion has less loss in detection capability than controlling the FWER and hence is preferable for large scale multiple tests. From the Bayesian point of view, the posterior version of FDR and of false nondiscovery rate (FNR) is easier to study. We study Bayesian decision rules to control Bayes FDR and FNR. A hierarchical mixture model is developed to estimate the posterior probability of hypotheses. The posterior distribution can also be used to estimate the false discovery percentage (FDP) defined as the integrand of the FDR. The model in conjunction with Bayesian decision rules displays satisfying performance in simulations and in the analysis of the Affymetrix HG-U133A spike-in data.

Seminar page

Qing Pan,   Dept. of Biostatistics,   University of Michigan.

Methods for Evaluating and Correcting Selection Bias Using Two-stage Weighted Proportional Hazards Models.

In non-randomized biomedical studies using the proportional hazards model, the observed data often constitute a biased (i.e., unrepresentative) sample of the underlying target population, resulting in biased regression coefficients. The bias can be corrected by weighting included subjects by the inverse of their representative selection probabilities, as proposed by Horvitz and Thompson (1952) and extended to the proportional hazards setting for use in surveys by Binder (1992) and Lin (2000). The weights can be treated as fixed in cases where they are known (e.g., chosen by the investigator) or based on voluminous data (e.g., a large-scale survey). However, in many practical applications, the weights are estimated and must be treated as such in order for the resulting inference to be accurate. We propose a two-stage weighted proportional hazards model in which, at the first stage, weights are estimated through a logistic regression model fitted to a representative sample from the target population. At the second stage, a weighted Cox model is fitted to the biased sample. We propose estimators for the regression parameter and cumulative baseline hazard. Asymptotic properties of the parameter estimators are derived, accounting for the difference in the variance introduced by the randomness of the weights. The accuracy of the asymptotic approximations in finite samples is evaluated through simulation. The proposed estimation methods are applied to kidney transplant data to quantify the true risk of graft failure associated with expanded criteria donor (ECD). Although parameter estimation is consistent and potentially more efficient using the proposed weighting method, computation is considerably more intensive than that for the unweighted model. We therefore propose methods for evaluating bias in the unweighted partial likelihood and Breslow-Aalen estimators. Asymptotic properties of the proposed test statistics are derived, and the finite-sample significance level and power are evaluated through simulation. The proposed methods are then applied to data from a national organ failure registry to evaluate the bias in a post kidney transplant survival model.

Seminar page

Malay Ghosh,   Dept. of Statistics,   University of Florida.

Stein Estimation and Prediction: A Synthesis.

Stein (1956, Proc. 3rd Berkeley Symposium), in his seminal paper, came with the surprising discovery that the sample mean is an inadmissible estimator of the population mean in three or higher dimensions under squared error loss. The past five decades have witnessed multiple extensions and variations of Stein's results. Extension of Stein's results to prediction problems is of more recent origin, beginning with Komaki (2001, Biometrika), George, Liang and Yu (2006, Annals of Statistics) and Ghosh, Mergel and Datta (2006). The present article shows how both the estimation and prediction problems go hand in hand under certain "intrinsic losses," which includes both the Kullback-Leibler and Bhattacharyya-Hellinger divergence losses. The estimators dominating the sample mean under such losses are motivated both from the Bayesian and empirical Bayes point of view.

Seminar page

Linda Young,   Dept. of Statistics,   University of Florida.

Statistical Challenges in Assessing the Relationship Between Environmental Impacts and Health Outcomes.

Citizens are increasingly interested in understanding how the environment impacts their health. To address these concerns, a nationwide Environmental Public Health Tracking program has been created. This, and many other efforts to relate environmental and health outcomes, depend largely on the synthesis of existing data sets; little new data are being collected for this purpose. Generally, the environmental, health, and socio-demographic data needed in such studies have been collected for different geographic or spatial units. Further, the unit of interest may be different from the sampling units. Once a common spatial scale has been established for the analysis, the question as to how best to model the relationship between environmental impacts and public health must be addressed. In this paper, these and other statistical challenges of relating environmental impacts to public health will be discussed. Efforts to model the relationship in myocardial infarction and air quality in Florida will illustrate the challenges and potential solutions.

Seminar page

Brian Caffo,   Dept. of Biostatistics,   The Johns Hopkins University.

Statistical Methods in Functional Medical Imaging.

In this talk we consider statistical methods in functional medical imaging. Functional imaging techniques, such as positron emission tomography, single photon emission computed tomography and functional magnetic resonance imaging, allow for in vivo imaging as the human body functions. The unique nature of the high volume data generated by these techniques gives rise to interesting statistical problems. In this talk we will discuss two statistical methodologic developments in functional imaging. In the first, we consider an application in kinetic imaging of the colon to experimentally understand the mechanics and penetration of a microbocide lubricant. A novel application of fitting three dimensional statistical curves via a modified principal curve algorithm is illustrated to solve the relevant problem. The algorithm will be shown to be tested and debugged on a battery of challenging two dimensional shapes. The second experiment considers a functional magnetic resonance imaging study of pre-clinical patients at high risk for Alzheimer's disease. Here a novel Bayesian multilevel approach is used to consider paradigm-related connectivity within and between anatomically defined regions of interest. Both examples will be geared towards a general audience without specialized background knowledge in medical imaging statistics.

Seminar page