UF Statistics Seminar Schedule

Seminars are held from 4:00 p.m. - 5:00 p.m. in Griffin-Floyd 100 unless otherwise noted.

Refreshments are available before the seminars from 3:30 p.m. - 4:00 p.m. in Griffin-Floyd Hall 103.

Fall 2017

Date Speaker

Title (click for abstract)

Sep 14 Miles Lopes (University of California, Davis)

Bootstrap Methods for High-Dimensional and Large-Scale Data

Sep 21 Gen Li (Columbia University)

A general framework for the association analysis of heterogeneous data

Oct 3 Hira Koul (Michigan State University)

Goodness-of-fit Testing of Error Distribution in Linear Measurement Error Models

Oct 12 Galin Jones (University of Minnesota)

Bayesian Penalized Regression (and a little MCMC)

Oct 26 Mariana Pensky (University of Central Florida)

Classification with many classes: challenges and pluses.

Nov 2 Jason Roy (University of Pennsylvania)

Outcome Identification in Electronic Health Records using Predictions from an Enriched Dirichlet Process Mixture

Nov 9 Hani Doss (University of Florida)

An MCMC Approach to Empirical Bayes Inference and Bayesian Sensitivity Analysis via Empirical Processes

Nov 16 Lifeng Lin (Florida State University)

Assessing publication bias in meta-analysis

Nov 30 Rebecca Steorts (Duke University)

Entity Resolution with Societal Impacts in Statistical Machine Learning


Bootstrap Methods for High-Dimensional and Large-Scale Data

Miles Lopes University of California, Davis

Bootstrap methods are among the most broadly applicable tools for statistical inference and uncertainty quantification. Although these methods have an extensive literature, much remains to be understood about their applicability in modern settings, where observations are high-dimensional, or where the quantity of data outstrips computational resources. In this talk, I will present a couple of new bootstrap methods that are tailored to these settings. First, I will discuss the topic of "spectral statistics" arising from high-dimensional sample covariance matrices, and describe a method for approximating the laws of such statistics. Second, in the context of large-scale data, I will discuss a more unconventional application of the bootstrap -- dealing with the tradeoff between accuracy and computational cost for ensemble classifiers. More specifically, I will explain how the bootstrap can be used to decide when an ensemble of classifiers trained by bagging or random forests is sufficiently large. This will include joint work with Alexander Aue and Andrew Blandino.

A general framework for the association analysis of heterogeneous data

Gen Li Columbia University

Multivariate association analysis is of primary interest in many applications. Despite the prevalence of high-dimensional and non-Gaussian data (such as count-valued or binary), most existing methods only apply to low-dimensional datasets with continuous measurements. We develop a new framework for the association analysis of two sets of high-dimensional and heterogeneous (continuous/binary/count) data. We model heterogeneous random variables using exponential family distributions, and exploit a structured decomposition of the underlying natural parameter matrices to identify shared and individual patterns for two datasets. We also introduce a new measure of the strength of association, and a permutation-based procedure to test its significance. An alternating iteratively reweighted least squares algorithm is devised for model fitting, and several variants are developed to expedite computation and achieve variable selection. The application to the Computer Audition Lab 500-song (CAL500) music annotation study sheds light on the relationship between acoustic features and semantic annotations, and provides an effective means for automatic annotation and music retrieval.

Goodness-of-fit Testing of Error Distribution in Linear Measurement Error Models

Hira Koul Michigan State University

In this talk we shall discuss a class of goodness-of-fit tests for the error density function in linear measurement errors regression models using a deconvolution kernel density estimators of the regression model error density. The test statistic is an analog of the Bickel and Rosenblatt type test statistic. The asymptotic null distribution of the proposed test statistics is derived for both the ordinary smooth and super smooth cases. The consistency against a fixed alternative and the asymptotic power of the proposed tests against a class of local nonparametric alternatives are also obtained for both cases. A finite sample simulation study shows some superiority of the proposed test compared to very few other existing tests. Joint work with Weixing Song and Xiaoyu Zhu.

Bayesian Penalized Regression (and a little MCMC)

Galin Jones University of Minnesota

I will consider ordinary least squares, lasso, bridge, and ridge regression methods under a unified framework. The particular method is determined by the form of the penalty term, which is typically chosen by cross validation. The goal is to introduce a fully Bayesian approach which allows selection of the penalty through posterior inference if desired and discuss how to use a type of model averaging approach to eliminate the nuisance penalty parameters. Sufficient conditions for the posterior to concentrate near the true regression coefficients as the dimension grows with sample size will be discussed.

The resulting posterior is analytically intractable and requires a component-wise Markov chain Monte Carlo algorithm. The MCMC estimation problem is highly multivariate, an issue which has been largely ignored in the MCMC literature. A new relative-volume simulation termination rule will be introduced and connected to a new concept of effective sample size. This allows termination of the simulation in a principled manner.

Numerical results show that the proposed model and MCMC method tends to select the optimal penalty and performs well in both variable selection and prediction. Examples will be provided.

Classification with many classes: challenges and pluses

Mariana Pensky University of Central Florida

We consider high-dimensional multi-class classification of normal vectors, where unlike standard assumptions, the number of classes may be also large. We derive the (non-asymptotic) conditions on effects of significant features, and the low and the upper bounds for distances between classes required for successful feature selection and classification with a given accuracy. Furthermore, we study an asymptotic setup where the number of classes is growing with the dimension of feature space and sample sizes. To the best of our knowledge, our paper is the first to study this important model. In particular, we present an interesting and, at first glance, somewhat counter-intuitive phenomenon that the precision of classification can improve as the number of classes grows.

Outcome Identification in Electronic Health Records using Predictions from an Enriched Dirichlet Process Mixture

Jason Roy (University of Pennsylvania)

We propose a novel semiparametric model for the joint distribution of a continuous longitudinal outcome and the baseline covariates using an enriched Dirichlet process (EDP) prior. This joint model decomposes into a linear mixed model for the outcome given the covariates and marginals for the covariates. The nonparametric EDP prior is placed on the regression and spline coefficients, the error variance, and the parameters governing the predictor space. We predict the outcome at unobserved time points for subjects with data at other time points as well as for completely new subjects with covariates only. We find improved prediction over mixed models with Dirichlet process (DP) priors when there are a large number of covariates. Our method is demonstrated with electronic health records consisting of initiators of second generation antipsychotic medications, which are known to increase the risk of diabetes. We use our model to predict laboratory values indicative of diabetes for each individual and assess incidence of suspected diabetes from the predicted dataset. Our model also serves as a functional clustering algorithm in which subjects are clustered into groups with similar longitudinal trajectories of the outcome over time.

An MCMC Approach to Empirical Bayes Inference and Bayesian Sensitivity Analysis via Empirical Processes

Hani Doss (University of Florida)

We consider situations in Bayesian analysis where the prior pi_h on the parameter theta is indexed by a continuous hyperparameter h, and we deal with two related problems. The first problem is as follows. Let m(h) be the marginal likelihood of the data (this is the likelihood of the data with theta integrated out with respect to the prior pi_h). The problem is to construct a confidence interval (or region if h is multidimensional) for argmax_h m(h), the value of h for which the marginal likelihood of the data is largest. This value of h is, by definition, the empirical Bayes choice of h. If for each h, \hat{m}(h) is an estimate of m(h)---typically obtained via MCMC---then we may estimate argmax_h m(h) via argmax_h \hat{m}(h). The second problem is as follows. Suppose we fix a function g of theta. Let I(h) be the posterior expectation of g(theta) when the hyperparameter of the prior is h. This is also typically estimated via MCMC. The problem is to construct confidence bands for I(h) that are valid simultaneously for all h. The first problem is in some sense a model selection problem, and the second is a form of Bayesian sensitivity analysis. The two problems are actually closely related in that to solve either of them we need uniformity (in h) of the convergence of the estimates. We show how tools from various parts of probability theory can be used to deal with these two problems. The methodology we develop applies very generally. As an illustration, we consider Latent Dirichlet Allocation, which is heavily used in topic modelling. The two-dimensional hyperparameter that governs this model has a large impact on inference. We show how our methodology can be used to select it. We also show how to form globally-valid "two-dimensional confidence bands" for certain posterior probabilities of interest as h varies continuously, and give an illustration on a corpus consisting of a set of articles from Wikipedia. Joint work with Yeonhee Park (MD Anderson Cancer Center). To view the abstract in pdf form, click here.

Assessing publication bias in meta-analysis

Lifeng Lin (Florida State University)

Publication bias is a serious problem in systematic reviews and meta-analyses, which can jeopardize the validity and generalization of conclusions. Current approaches to dealing with publication bias can be distinguished into two classes: selection models and funnel-plot-based methods. Although various statistical methods are available to assess publication bias, their performance has not been systematically evaluated by comprehensive real applications. We compare several commonly-used publication bias tests and evaluate their agreements based on a large collection of meta-analyses from the Cochrane Library. The agreements among most tests are found to be low or moderate. In addition, so far it is popular to test for publication bias, while measures for quantifying publication bias are seldom studied in the literature. Such measures can be used as a characteristic of a meta-analysis, and they permit comparisons of publication biases between different meta-analyses. Egger’s regression intercept is a candidate measure, but it lacks an intuitive interpretation. We introduce a new measure, the skewness of the standardized deviates, to quantify publication bias. This measure describes the asymmetry of the collected studies’ distribution. Also, a new test for publication bias is derived based on the skewness. Large sample properties of the new measure are studied, and its performance is illustrated using simulations and case studies.

Entity Resolution with Societal Impacts in Statistical Machine Learning

Rebecca Steorts (Duke University)

Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions. In practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns. Analysts need to use methods from statistical and computational science known as entity resolution (record linkage or de-duplication) to proceed with analysis. Entity resolution is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself. In this talk, we describe the past and present challenges with entity resolution, with applications to the Syrian conflict but also official statistics, and the food and music industry. This large collaboration touches on research that is crucial to problems with societal impacts that are at the forefront of both national and international news. Meeting Schedule

Past Seminars

Spring 2017 Fall 2016
Spring 2016 Fall 2015 Fall 2014 Fall 2013
Spring 2013 Fall 2012 Spring 2012 Fall 2011
Spring 2011 Fall 2010 Spring 2010 Fall 2009
Spring 2009 Fall 2008 Spring 2008 Fall 2007
Spring 2007 Fall 2006 Spring 2006 Fall 2005
Spring 2005 Fall 2004 Spring 2004 Fall 2003
Spring 2003 Fall 2002 Spring 2002 Fall 2001
Spring 2001 Fall 2000 Spring 2000 Fall 1999