Weekly Seminars

Seminars are held on Thursdays from 4:00 p.m. - 5:00 p.m. in Griffin-Floyd 100 (unless otherwise noted).

Refreshments are available before the seminars from 3:30 p.m. - 4:00 p.m. in Griffin-Floyd Hall 230.

Fall 2007

Date Speaker

Title (click for abstract)

Sep 13 George Casella (UF)
Testing for Clusters, with Applications in Maize Genetics
Sep 20
----no seminar this week----
linux tutorial at this time
Sep 27 Mark Yang (UF)
An optimal DNA pooling strategy for progressive fine genetic mapping
Oct 4 Andrew Lawson (USC)

Latent Structure in Space-time Disease Maps: links to environmental hazard events

Oct 11 Bruce Turnbull (Cornell)
Adaptive and Non-Adaptive Group Sequential Trial Designs
Oct 18 Jing Cheng (UF) Estimation and Inference for the Causal Effect of Receiving Treatment on a Multinomial Outcome  
Oct 26 Jinfeng Zhang (FSU)

Novel Monte Carlo Methods for Protein Structure Modeling

UF/FSU seminar

Nov 1 Vivekananda Roy (UF) Convergence rates and asymptotic standard errors for MCMC algorithms for Bayesian probit regression  
Nov 8 Jie Yang (UF) Nonparametric Funtional Mapping of Quantitative Trait Loci  
Nov 13 Lawrence Brown (UPenn)

In-Season Prediction of Batting Averages: A Field-test of Simple Empirical Bayes and Bayes Methodologies

Challis Lecture
general audience

Nov 14 Lawrence Brown (UPenn)
Nonparametric Density Estimation via the Root-Unroot Transform; with an Adaptive Wavelet Block Threshholding Implementation
Challis Lecture
Nov 29 Gauri Datta (UGA)
Model selection for count data: Poisson or ZIP?


Testing for Clusters with Applications in Maize Genetics

George Casella (UF)

The existence of clusters in maize data suggests genetic segregation and could lead to the discovery of new strains.  If clustering is suspected, the sample is then subjected to a more detailed, time consuming and expensive, analysis.  Thus, there is need for a good screening test for the existence of clusters.

We model this as a hypothesis test, and attempt to compute a Bayes factor to test the null hypothesis of no clusters.  This leads us down a path of combinatorics involving Stirling numbers, and the derivation of an algorithm to generate random partitions in a constrained cluster space.  Ultimately, we can calculate the Bayes factor using an importance sampler implemented with a hybrid Metropolis-Hastings algorithm.

An optimal DNA pooling strategy for progressive fine genetic mapping

Mark Yang (UF)

A basic strategy to optimize DNA pooling for expense reduction while maintaining a desirable resolution in genetic fine mapping is developed. Although any genotyping en masse will lead to loss of estimation accuracy, the loss may be compensated for by a small increase in sample size, suggesting that we can reduce the overall experimental expense by balancing costs for genotyping and sample collection. Within a certain range that depends on the parameter scenario, a pooling scheme can reach nearly the same precision as individual genotyping by just increasing a few percent in the number of individuals, whereas reducing the genotyping effort to only a small fraction of what is needed for genotyping without pooling. Further, the genotyping burden can be reduced through multi-stage pooling. Numerical results are provided for practical guidelines. Finally, we use a set of genetic data on mapping the rice xl(t) gene to demonstrate the feasibility and efficiency of the DNA pooling strategy. Taken together, the results demonstrate that this DNA pooling strategy can greatly reduce the genotyping burden and the overall cost in fine mapping experiments.

Latent Structure in Space-time Disease Maps: links to environmental hazard events

Andrew Lawson (USC)

In the assessment of the linkage between environmental risk gradients and health outcomes there is often a need to consider the possibility that risk is multi-faceted. Many models for disease risk in the spatial or spatio-temporal domain regard the map as being defined by global parameters with single underlying components. However it is commonly true that unknown risk structures (multiple risk gradients, or population sub -groups) could interweave to result in a single realisation of disease. In this talk I will first discuss the context of spatio-temporal modeling of disease. I will then describe two Bayesian approaches to the analysis of latent structure in space-time disease maps: 1) time -dependent latent mixture components with spatial weights; 2) covariate spline interaction models. A comparison will be made with a standard Bayesian ST model.
Issues of identifiability and appropriateness of DIC will be discussed.

Adaptive and Non-Adaptive Group Sequential Trial Designs

Bruce Turnbull (Cornell)

Methods have been proposed to re-design a clinical trial at an interim stage in order to increase power.  This may be in response to external factors which indicate power should be sought at a smaller effect size, or it could be a reaction to data observed in the study itself.  In order to preserve the type I error rate, methods for unplanned design change have to be defined in terms of non-sufficient statistics and this calls into question their efficiency and the credibility of conclusions reached.  We evaluate schemes for adaptive re-design, assessing the possible benefits of pre-planned adaptive designs by numerical computation of optimal tests; these optimal adaptive designs are concrete examples of optimal sequentially planned sequential tests proposed by Schmitz (1993).  We conclude that the flexibility of unplanned adaptive designs comes at a price and we recommend the appropriate power for a study should be determined as thoroughly as possible at the outset.  Then, standard error spending tests, possibly with unevenly spaced analyses, provide efficient designs. However it is still possible to fall back on flexible methods for re-design should study objectives change unexpectedly once the trial is under way. (This is joint work with Chris Jennison.)

Estimation and Inference for the Causal Effect of Receiving Treatment on a Multinomial Outcome

Jing Cheng (UF)

Noncompliance is a common problem in randomized clinical trials.  When there is noncompliance, there is often interest in estimating the causal effect of actually receiving the treatment compared to receiving the control. There is a large literature on methods of analysis for clinical trials with noncompliance which have continuous or binary outcomes. Clinical trials with multinomial outcomes are common in practice but not much attention has been paid to analyzing these trials with noncompliance. This talk considers the analysis of these trials. We first define the causal effect in these trials and estimate it with the likelihood method, and then conduct inference based on the likelihood ratio test. The likelihood ratio statistic follows a chi-squared distribution asymptotically when the true values of parameters are in the interior of the parameter space under the null. However, when the true values of parameters are on the boundary of the parameter space, the likelihood ratio statistic does not have an asymptotic chi-squared distribution. Therefore, we propose a bootstrap/double bootstrap version of a likelihood ratio test for the causal effect in these trials. The methods are illustrated by an analysis of data from a randomized trial of an encouragement intervention to improve adherence to prescribed depression treatments among depressed elderly patients in primary care practices.

Novel Monte Carlo Methods for Protein Structure Modeling

Jinfeng Zhang (FSU)

Proteins are the machinery of life and are involved in almost all biological processes. A protein functions through a well defined three-dimensional structure. The next step after genome sequencing projects is to understand the sequence and structure relationship of proteins or genes encoded in the genomes. Such understanding will fundamentally advance biology and public health. The protein structure modeling problem has been formulated as first defining a free energy function for protein structures, which follow a Boltzmann distribution, and then designing a sampling method so that protein structures can be sampled from the distribution. Both defining the energy function and designing sampling methods are very challenging. We developed new Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methods for studying protein structures. The MCMC method was tested on a simplified protein folding model, HP model. Finding minimum energies for HP sequences is still an open problem with a 20-year history. The new method significantly outperformed all previously reported methods. SMC was applied to estimate entropy and free energy of ensemble protein structures, one of the most challenging problems in computational chemistry and biophysics. I will present some interesting findings discovered through sampled structures using the method.

Convergence rates and asymptotic standard errors for MCMC algorithms for Bayesian probit regression

Vivekananda Roy (UF)

Consider a probit regression problem in which Y1,. . . ,Yn are independent Bernoulli random variables such that Pr(Yi=1) = Φ(xiT β) where xi is a p-dimensional vector of known covariates associated with Yi, β is a p-dimensional vector of unknown regression coefficients and Φ(·) denotes the standard normal distribution function. We study Markov chain Monte Carlo algorithms for exploring the intractable posterior density that results when the probit regression likelihood is combined with a flat prior on β. We prove that Albert and Chib's (1993) data augmentation algorithm and Liu and Wu's (1999) PX-DA algorithm both converge at a geometric rate, which ensures the existence of central limit theorems (CLTs) for ergodic averages under a second moment condition. While these two algorithms are essentially equivalent in terms of computational complexity, from Hobert and Marchev (2007) it follows that the PX-DA algorithm is theoretically more efficient in the sense that the asymptotic variance in the CLT under the PX-DA algorithm is no larger than that under Albert and Chib's algorithm. We calculate the standard error of the ergodic averages using regenerative simulation. As an illustration, we apply our results to ven Dyk's and Meng's (2001) lupus data. This particular example demonstrates that huge gains in efficiency are possible by using the PX-DA algorithm instead of Albert and Chib's algorithm.

Nonparametric Funtional Mapping of Quantitative Trait Loci

Jie Yang (UF)

Functional mapping is a powerful tool for mapping quantitative trait loci (QTLs) that control dynamic traits. This method incorporates mathematical aspects of biological processes into the mixture model-based likelihood setting for QTL mapping, thus increasing the power of QTL detection and the precision of parameter estimation. However, in many situations there is no obvious functional form and, in such cases, this strategy will not be optimal. Here we propose to use nonparametric function estimation, typically implemented with B-splines, to estimate the underlying functional form of phenotypic trajectories, and then construct a nonparametric test to find evidence of existing quantitative trait loci. Using the representation of a nonparametric regression as a mixed model, the final test statistic is a likelihood ratio test. We consider two types of genetic maps: dense maps and general maps, and the power of nonparametric functional mapping is investigated through simulation studies and demonstrated by examples.

In-Season Prediction of Batting Averages: A Field-test of Simple Empirical Bayes and Bayes Methodologies

Lawrence Brown (UPenn)

Batting average is one of the principal performance measures for an individual baseball player. It has a simple numerical structure as the percentage of successful attempts, “Hits”, as a proportion of the total number of qualifying attempts, “At-Bats”. This situation, with Hits as a number of successes within a qualifying number of attempts, makes it natural to statistically model each player’s batting average as a binomial variable outcome, with a given value of ABi and a true (but unknown) value of pi that represents the player’s latent ability. This is a common data structure in many statistical applications; and so the methodological study here has implications for such a range of applications.

We will look at batting records for every Major League player over the course of a single season (2005). The primary focus is on using only the batting record from an earlier part of the season (e.g., the first 3 months) in order to predict the batter’s latent ability, pi, and consequently to predict their batting-average performance for the remainder of the season. Since we are using a season that has already concluded, we can validate our predictive performance by comparing the predicted values to the actual values for the remainder of the season.

The methodological purpose of this study is to gain experience with a variety of predictive methods applicable to a much wider range of situations. Several of the methods to be investigated derive from empirical Bayes and hierarchical Bayes interpretations. Although the general ideas behind these techniques have been understood for many decades*, some of these methods have only been refined relatively recently in a manner that promises to more accurately fit data such as that at hand.

One feature of all of the statistical methodologies here is the preliminary use of a particular form of variance stabilizing transformation in order to transform the binomial data problem into a somewhat more familiar structure involving (approximately) Normal random variables with known variances. This transformation technique is also useful in validating the binomial model assumption that is the conceptual basis for all our analyses.

* A particularly relevant background reference is Efron, B. and Morris, C. (1977) Stein’s paradox in statistics” Scientific American 236 119-127, and the earlier, more technical version (1975), “Data analysis using Stein’s estimator and its generalizations” Jour. Amer. Stat. Assoc. 70 311-319.

Nonparametric Density Estimation via the Root-Unroot Transform; with an Adaptive Wavelet Block Threshholding Implementation

Lawrence Brown (U Penn)

Nonparametric density estimation has traditionally been treated separately from nonparametric regression. Here, we propose an approach that first transforms a density estimation problem into a nonparametric regression problem. The algorithm for this involves suitably binning the observations and then transforming the binned data counts via a carefully chosen square-root transformation. Then any suitable nonparametric regression procedure can be used.

Here, a wavelet block-threshholding rule is used for the transformed regression problem, and this produces an estimated nonparametric regression function. Finally an adjusted un-root transform is applied to yield the final nonparametric density estimator.

The procedure is easy to implement. It enjoys a high degree of asymptotic adaptivity and is shown in numerical examples to perform well for standard density estimation settings. As time permits, we will also discuss a corresponding procedure to produce confidence bands to accompany the nonparametric regression and density estimators.

Model selection for count data: Poisson or ZIP?

Gauri Datta (UGA)

Count data are often encountered in many studies, most frequently in disease modeling. The Poisson distribution, which is usually adopted to describe a model for such dataset, sometimes does not work well in presence of many zeros in the data. To account for excessive zeros in count data, a zero-inflated Poisson (ZIP) distribution is suggested in the literature. A ZIP distribution is a mixture of a standard Poisson distribution and a degenerate distribution at zero, with a mixing probability p.

The ZIP distribution has been used both for independent and identically distributed (i.i.d.) observations and in non-i.i.d. case where suitable auxiliary variables are available to model the mean. In the latter case, which is referred to as a ZIP regression model, each count is assumed to have a different distribution depending on some explanatory variable(s) and suitable generalized linear models are fitted to the Poisson parameter and/or to the mixing probability Although there are a number of frequentist papers discussing statistical inference for such models, Bayesian contribution to this problem is limited. In this talk, we propose two Bayesian solutions to this problem. In our first solution, treating it as a model selection problem, we rewrite the ZIP model as a mixture of a zero-truncated Poisson distribution and a degenerate distribution at zero. We justify an objective prior for the new parameters. Using this prior and the standard Jeffreys' prior for the Poisson mean we obtain the Bayes factor for the ZIP model versus the standard Poisson model. In our second approach, in the i.i.d. setup we embed the ZIP model into a larger class of models by suitably extending the parameter space. Our Bayesian test depends on the posterior probability of the hypothesis of zero inflation. Some applications of both solutions and suitable extension to the regression case will be discussed.

Past Seminars

Spring 2007 Fall 2006 Spring 2006 Fall 2005
Spring 2005 Fall 2004 Spring 2004 Fall 2003
Spring 2003 Fall 2002 Spring 2002 Fall 2001
Spring 2001 Fall 2000 Spring 2000 Fall 1999