Samiran Sinha, University of Florida

Semiparametric Bayesian Analysis of Matched Case-Control Studies with Missing Exposure

We consider Bayesian analysis of matched case-control problems when one of the covariates is partially missing. The standard approach of conditional logistic regression fails to work efficiently in the presence of missing exposure variable. The present work develops a likelihood based approach modeling the distribution of the partially missing exposure. Within the likelihood context, the standard approach to this problem is to posit a fully parametric model among the controls for the partially missing covariate as a function of the covariates in the model and the variables making up the strata: sometimes, the stratum effects are ignored at this stage. Our approach differs not only in that it is Bayesian, but far more importantly in the manner in which we treat the strata effects. In the matched case control study with no missing data, the strata are treated completely nonparametrically. Our approach is a Bayesian version of this common frequentist device: we assume a Dirichlet process prior with a base measure normal distribution for the stratum effects and estimate all the parameters in a Bayesian framework. Since the posterior distributions of the parameters are in non-standard form we use MCMC technique to estimate the parameters. Two matched case-control examples and a simulation study are considered to illustrate our methods and the computing scheme.

We extend our methods to the situations when the disease has multiple categories, when one has a set of mutually associated exposures, or when the exposures are measured with errors.
Seminar page


Bernhard Klingenberg, University of Florida

Regression Models for Discrete-Valued Time Dependent Data

Independent random effects in generalized linear models induce an exchangeable correlation structure, but long sequences of counts or binomial observations typically show correlations decaying with increasing lag. This talk introduces models with autocorrelated random effects for a more appropriate, parameter driven analysis of discrete-valued time series data. We present a Monte Carlo EM algorithm with Gibbs sampling to jointly obtain maximum likelihood estimates of regression parameters and variance components. Marginal mean, variance and correlation properties of the conditionally specified models are derived for Poisson, negative binomial and binary/binomial random components. Estimation of the joint probability of two or more events is possible and used in predicting future responses. Also, all methods are flexible enough to allow for multiple gaps or missing observations in the observed time series. The approach is illustrated with a time series of 168 monthly counts of polio infections and a long binary time series.
Seminar page


David Hitchcock, University of Florida

Smoothing Functional Data for Cluster Analysis

Cluster analysis is an important exploratory tool for analyzing many types of data. In particular, we explore the problem of clustering functional data, which arise as curves, characteristically observed as part of a continuous process. We examine the effect of smoothing such data on dissimilarity estimation and cluster analysis. We prove that a shrinkage method of smoothing results in a better estimator of the dissimilarities among a set of noisy curves. Strong empirical evidence is given that smoothing functional data before clustering results in a more accurate grouping than clustering the observed data without smoothing. An example involving yeast gene expression data illustrates the technique. (This is Ph.D. dissertation research directed by James Booth and George Casella.)
Seminar page


Karabi Sinha, University of Florida

Measurement Error Models in Small Area Estimation

This work looks at the problem of estimation in the small area setup where the covariates are measured with error. In other words, it considers the role of measurement error models in small area estimation.

We consider simultaneous estimation of finite population means for several strata based on two different model structures and assumptions. In each consideration,a model-based approach is taken, where the covariates in the super-population model are subject to measurement errors. In the first set-up, EB estimators of the strata means are developed and an asymptotic expression of the Mean Square Error of the vector of EB estimators is attained. In the second set-up, we consider developing both EB and HB estimators of the strata means. In both cases, findings are supported by appropriate data analyses and are further validated by simulation studies.
Seminar page


Sam Wu, University of Florida

t vs. T2 in Simultaneous Inference

There is probably no study that does not require multiple hypothesis testing. It is well known that in a general multi-parameter setting, there may not exist any unique best test. More importantly, unlike the univariate case, the power of different test procedures could vary remarkably in multivariate analysis. In this talk we discuss three methods of combining dependent univariate tests for multivariate location problem. A Monte Carlo study indicate that in many cases the powers of the combination methods are much better than the Hotelling type tests. Relationship is established between Fisher's method of combining tests and a new class of tests that have best average power for multivariate linear hypotheses. Furthermore, two step-up simultaneous tests regarding the number of true null hypothesis are proposed. It is shown that each procedure controls the familywise error rate in the strong sense. Applications in microarray analysis, QTL detection, and composite analysis of clinical trials are also considered.
Seminar page


Jim Hobert, University of Florida

A Mixture Representation of the Stationary Distribution

When a Markov chain satisfies a minorization condition, its stationary distribution can be represented as an infinite mixture. The distributions in the mixture are associated with the hitting times on an accessible atom introduced via the splitting construction of Athreya and Ney (1978). This mixture representation is closely related to perfect sampling and has applications in Markov chain Monte Carlo.

(This is joint work with Christian Robert, Universite Paris Dauphine.)
Seminar page


Marianna Pensky, University of Central Florida

Wavelet Kernel Penalized Estimation for Non-equispaced Design Regression

The talk considers regression problems with univariate design points. The design points are irregular and no assumptions on their distribution are imposed. The regression function is retrieved by a wavelet based reproducing kernel Hilbert space (RKHS) technique with the penalty equal to the sum of blockwise RKHS norms. In order to simplify numerical optimization, the problem is replaced by an equivalent quadratic minimization problem with an additional penalty term. The computational algorithm is described in detail and is implemented with both the sets of simulated and real data. Comparison with existing methods showed that the technique suggested in the paper does not oversmooth the function and is superior in terms of the mean squared error. It is also demonstrated that under additional assumptions on design points the method achieves asymptotic optimality in a wide range of Besov spaces.

This is joint work with U. Amato (CNR, Naples, Italy) and A. Antoniadis (Laboratoire IMAG-LMC, University Joseph Fourier, France).
Seminar page


Xueli Liu, University of Florida

Functional Convex Averaging for Time-Warped Random Curves and Its Application to Clustering Temporal Gene Expression Data

Data often arise as a sample of curves in science and engineering. When the dynamics of development, growth or response over time are at issue, subjects or experimental units may experience events at a different temporal pace. For functional data where trajectories may be individually time-transformed, it is usually inadequate to use commonly employed sample statistics such as the cross-sectional mean. One may then consider subjecting each observed curve to a time transformation in an attempt to reverse the warping of the time scale, prior to further statistical analysis. Dynamic time warping, alignment, curve registration and landmark-based methods have been put forward with the goal of finding adequate empirical time transformations.

Previous analyses of warping have typically not been based on a model where individual observed curves are viewed as realizations of a stochastic process. We propose a functional convex synchronization model, under the premise that each observed curve is the realization of a stochastic process. Monotonicity constraints on time evolution provide the motivation for a functional convex calculus with the goal of obtaining sample statistics such as a functional mean. Observed random functions in warped time space are represented by a bivariate random function in synchronized time space, consisting of a stochastic monotone time transformation function and an unrestricted random amplitude function. This leads to the definition of a functional convex average or "longitudinal average", which is in contrast to the conventional "cross-sectional" average. We derive a functional limit theorem and asymptotic confidence intervals for functional convex means. The results are illustrated with a novel time warping transformation. The methods are applied to simulated data and the Berkeley growth data. This nonparametric time-synchronized algorithm is also combined with an iterative mean updating technique to find an overall representation that corresponds to a mode of a sample of gene expression profiles, viewed as a random sample in function space.

This talk is based on joint works with Dr. Hans-Georg Müller.
Seminar page


Xiao-Li Meng, Harvard University

Inference with Monte Carlo Data: A Paradox of "Knowing Too Much"? (Or "How to cure our schizophrenia?")

In the past half century or so, physicists, computer scientists, statisticians and many others have made tremendous advances in designing efficient Monte Carlo algorithms. In contrast, methods for Monte Carlo estimation, namely, using the simulated data generated by these algorithms to estimate quantities of interest, are almost exclusively based on the most primitive estimation technique, that is, by taking sample average or its simple variations (e.g., importance sampling). So what happened to all these wonderful estimation methods, such as maximum likelihood and Bayesian methods? Given these methods are so powerful for analyzing real data where the underlying true model is at best partially known, why are they not used for analyzing simulated data, where the underlying model is completely known (at least in principle)?

Based on a recent paper by Kong, McCullagh, Meng, Nicolae, and Tan (Journal of The Royal Statistical Society, 2003, 585-618), this talk demonstrates that a satisfactory answer to such questions not only satisfies our philosophical curiosity, but more importantly it can lead to Monte Carlo estimators with efficiency that are generally unaware of. In particularly we give a practical example where the new Monte Carlo estimator converges at the super fast 1/n rate instead of the usual 1/sqrt(n) rate, where n is the size of the simulation data.
Seminar page


Robert Tempelman, Michigan State University

A General Approach to Mixed Effects Modeling of Residual Variances in Generalized Linear Mixed Models

We propose a mixed effects approach to structured heteroskedastic error modeling for generalized linear mixed models in which linked functions of subject-specific means and residual variances are each specified as separate linear combinations of fixed and random effects. We focus on the linear mixed model (LMM) analysis of Gaussian data and the cumulative probit mixed model (CPMM) analysis of ordinal data. All analyses were based on the use of Markov chain Monte Carlo methods with each model based on a Bayesian hierarchical construction. The deviance information criterion (DIC) was demonstrated to be useful in correctly choosing between homoskedastic and heteroskedastic error GLMM for both traits when data was generated according to a mixed model specification for both location parameters and residual variances. Heteroskedastic error LMM and CPMM were fitted, respectively, to birthweight (BW) and calving ease (CE) data on calves from Italian Piemontese first parity dams. For both traits, residual variances were modeled as functions of fixed calf sex and random herd effects. The posterior mean residual variance for male calves was over 40% greater than that for female calves for both traits. Also, the posterior means of the coefficient of variation of herd-specific variance ratios were estimated to be 0.60±0.09 for BW and 0.74±0.14 for CE. For both traits, the heteroskedastic error LMM and CPMM were chosen over their homoskedastic error counterparts based on DIC values. The benefits of heavy-tailed (e.g. Student t) specifications for structured heteroskedastic error models are also briefly illustrated.
Seminar page


Sujit Ghosh, North Carolina State University

Semiparametric Bayesian Inference Based on AFT Models

An Accelerated Failure Time (AFT) semiparametric regression model for censored data is proposed as an alternative to the widely used proportional hazards survival model. The proposed regression model for censored data turns out to be flexible and practically meaningful. Features include physical interpretation of the regression coefficients through the mean response time instead of the hazard functions. It is shown that the regression model obtained by a mixture of parametric families, has a proportional mean structure. The statistical inference is based on a nonparametric Bayesian approach that uses a Dirichlet process prior for the mixing distribution. Consistency of the posterior distribution of the regression parameters in the Euclidean metric is established under certain conditions. Finite sample parameter estimates along with associated measure of uncertainties can be computed by a MCMC method. Simulation studies are presented to provide empirical validation of the new method. Some real data examples are provided to show the easy applicability of the proposed method.

(joint work with Subhasis Ghosal, NCSU)
Seminar page


Rongling Wu, University of Florida

Functional Mapping: Towards High-Dimensional Biology

Many complex traits inherently undergo remarked developmental changes during ontogeny. Traditional mapping approaches that analyze phenotypic data measured at a single point are too simple to take into account such a high-dimensional biological issue. We have developed a general framework, called functional mapping, in which the foundation is established for mapping quantitative trait loci (QTL) that underlie variation in a complex trait of dynamic feature. Functional mapping provides a useful quantitative and testable framework for assessing the interplay between gene action, development, sex, genetic background and environment.
Seminar page


George Casella, University of Florida

Objective Bayesian Analysis of Contingency Tables

The statistical analysis of contingency tables is typically carried out with a hypothesis test. In the Bayesian paradigm, default priors for hypothesis tests are typically improper, and cannot be used. Although such priors are available, and proper, for testing contingency tables, we show that for testing independence they can be greatly improved on by so-called intrinsic priors.

We also argue that because there is no realistic situation that corresponds to the case of conditioning on both margins of a contingency table, the proper analysis of an a × b contingency table should only condition on either the table total or on only one of the margins. The posterior probabilities from the intrinsic priors provide reasonable answers in these cases. Examples using simulated and real data are given.
Seminar page


Jonathan Shuster, University of Florida

Second Guessing Clinical Trial Designs

NB: This is work done jointly with Dr. Myron Chang

Suppose you, as a medical journal reviewer, read a report of a randomized clinical trial that fails to mention any sequential monitoring plan. Can you ask the "What if" question without obtaining the actual data? Surprisingly, in many instances, the answer is yes. If the trial information accrues as approximate Brownian Motion, then the joint predictive distribution of any collection of effect size estimates at times before the final analysis depends only upon the effect size estimate at the final analysis. Hence, you can superimpose a hypothetical group sequential design upon the non-sequential design. As a side benefit of this research, reference designs, with optimal or near optimal expected sample sizes relative to single stage designs, are presented so that as a reviewer you would simply consult a table to assess the predictive probabilities for each stopping time and the conditional mean sample size. In addition, you can superimpose your own group sequential design upon an actual group sequential design to second guess what you think was a poor choice of stopping boundaries. In this case, you need the Z-scores (or single degree of freedom chi-squares) at each interim look. This new capability should alter the mindset of those designing clinical trials. Since journal editors and the public can subject the trial to close scrutiny after the fact, trial designers will be more motivated to making their designs as efficient as possible. We shall present two actual examples that were heavily criticized for staying open too long. In one, it is probable that participants were not properly protected, and a report of critically important public health benefit was withheld unreasonably from the public. In the other, the criticism seems to have been unfounded.
Seminar page


Susan Ellenberg, Food and Drug Administration

Statistics and Bioethics: A Necessary Integration

A fundamental ethical principle underlying medical research is that the research be designed and conducted in a scientifically valid way. The Declaration of Helsinki, an international statement of ethical principles for medical research, includes as one of its articles, "Medical research involving human subjects must conform to generally accepted scientific principles, be based on a thorough knowledge of the scientific literature, other relevant sources of information, and on adequate laboratory and, where appropriate, animal experimentation." In addition to the statistician's role in ensuring the validity of research design and conduct (and thereby its ethical acceptability), statisticians are well positioned to identify aspects of study designs that raise specific ethical concerns, and to develop approaches that avoid such concerns. In this presentation, I will review a number of ethical issues that have arisen regarding the design and conduct of clinical trials, and discuss the role of statisticians in addressing these issues. Particular issues for discussion include interim data monitoring and early decision-making, use of placebo controls and design of active control trials, randomized consent designs, adaptive allocation to treatment arm, and trials in special populations.
Seminar page


Susan Ellenberg, Food and Drug Administration

Public Perceptions of Public Health: Challenges and Dilemmas

Public health policy-making is often complicated, involving the weighing of risks and benefits that may not be as precisely defined as one might like, economic issues, and, when the issue at hand is highly visible, anticipated public perceptions. Public perceptions have become increasingly important in an era of rapid and easy access to information (and misinformation) through the World Wide Web, and can create substantial challenges when the relevant issues are scientifically complex, the data are less than optimally definitive, and/or when the stakes are high. For example, an individual with a serious disease may learn about a promising new treatment that is early in clinical development, and may naturally wish to gain access to that treatment, even though little data are as yet available on its clinical effectiveness and its risks. On the other hand, an individual who receives a treatment that has been widely studied and has been deemed "safe and effective" by regulatory authorities, but who then suffers a serious adverse effect of the treatment that had not been recognized earlier as a potential rare risk, may believe that the treatment was made available too early, and that sufficient study should have been required to have identified this adverse effect before the treatment was approved for use. The growth in public advocacy groups is another factor in public health policy and communication. Advocacy groups can be highly effective partners in setting policies and in communicating their rationale; but they can also create obstacles if what appears to public health officials to be appropriate policy is viewed negatively by one or more groups. The complexity of many public health issues can make it difficult to communicate effectively to the public regarding the rationale for policies taken. A number of examples of these challenges and dilemmas will be presented.
Seminar page


Glen Meeden, University of Minnesota

A Synthesis of Objective Bayesian and Designed Based Methods for Finite Population Sampling

In the frequentist or design approach to finite population sampling prior information is incorporated in the sampling design. In the Bayesian approach prior information is incorporated through a prior distribution. Since the posterior distribution does not depend on the design it has been difficult to theoretically reconcile the two approaches. The Polya posterior is an objective Bayesian approach to finite population sampling that is appropriate when little or no prior information is available. We will discuss generalizations of the Polya posterior which allow one to objectively incorporate into a Bayesian model some types of information that are usually encapsulated in the design. Inferences which are found through simulation from the resulting posteriors will typically have good frequentist properties and yield sensible estimates of variance.
Seminar page


Richard Lynch, Six Sigma Academy

Statistics and Six Sigma: An Explanation of Six Sigma for Statisticians

Six Sigma is a widely popular process improvement methodology. In this seminar, an introduction to Six Sigma will be presented. The professions of Six Sigma Blackbelt and Master Blackbelt will be presented as options for graduate degree statisticians. The statistical methods imbedded in the Six Sigma DMAIC roadmap will be highlighted. In particular, the Gauge R&R study will be examined. Opportunities for research in the realm of Gauge R&R will be noted. The impact of measurement error on product dispositioning will be demonstrated with simulations.
Seminar page


Joseph Lang, University of Iowa

Profile Confidence Intervals for Contingency Table Parameters

A general method for computing profile confidence intervals for contingency table parameters is described. The method, which is based on the theory of multinomial-Poisson homogeneous models, lends itself to a general computational algorithm and is applicable for a broad class of parameters. The literature suggests that profile score and profile likelihood confidence intervals generally have better coverage properties than their Wald counterpart. Profile intervals have been used on a case-by-case basis for several different contingency table parameters, e.g., odds ratio, relative risk, and risk difference. These examples in the literature use a common computational approach that has two main limitations: it is case-specific and it is applicable for a restrictive class of parameters. The method proposed in this presentation avoids these limitations. Examples of profile confidence intervals for a variant of the gamma measure of association, a mean, and a dispersion measure illustrate the method.
Seminar page