UF Statistics Seminar Schedule
Seminars are held from 4:00 p.m.  5:00 p.m. in GriffinFloyd 100 unless otherwise noted.
Refreshments are available before the seminars from 3:30 p.m.  4:00 p.m. in GriffinFloyd Hall 103.
Fall 2017
Date  Speaker 
Title (click for abstract) 

Sep 14  Miles Lopes (University of California, Davis)  
Sep 21  Gen Li (Columbia University) 
A general framework for the association analysis of heterogeneous data 
Oct 3  Hira Koul (Michigan State University) 
Goodnessoffit Testing of Error Distribution in
Linear Measurement Error Models 
Oct 12  Galin Jones (University of Minnesota)  
Oct 26  Mariana Pensky (University of Central Florida) 
Classification with many classes: challenges and pluses. 
Nov 2  Jason Roy (University of Pennsylvania)  
Nov 9  Hani Doss (University of Florida)  
Nov 16  Lifeng Lin (Florida State University)  
Nov 30  Rebecca Steorts (Duke University) 
Entity Resolution with Societal Impacts in Statistical Machine Learning 
Abstracts
Bootstrap Methods for HighDimensional and LargeScale Data 
Miles Lopes University of California, Davis Bootstrap
methods are among the most broadly applicable tools for statistical
inference and uncertainty quantification. Although these methods have
an extensive literature, much remains to be understood about their
applicability in modern settings, where observations are
highdimensional, or where the quantity of data outstrips computational
resources. In this talk, I will present a couple of new bootstrap
methods that are tailored to these settings. First, I will discuss the
topic of "spectral statistics" arising from highdimensional sample
covariance matrices, and describe a method for approximating the laws
of such statistics. Second, in the context of largescale data, I will
discuss a more unconventional application of the bootstrap  dealing
with the tradeoff between accuracy and computational cost for ensemble
classifiers. More specifically, I will explain how the bootstrap can be
used to decide when an ensemble of classifiers trained by bagging or
random forests is sufficiently large. This will include joint work with
Alexander Aue and Andrew Blandino.

A general framework for the association analysis of heterogeneous data 
Gen Li Columbia University Multivariate
association analysis is of primary interest in many applications.
Despite the prevalence of highdimensional and nonGaussian data (such
as countvalued or binary), most existing methods only apply to
lowdimensional datasets with continuous measurements. We develop a new
framework for the association analysis of two sets of highdimensional
and heterogeneous (continuous/binary/count) data. We model
heterogeneous random variables using exponential family distributions,
and exploit a structured decomposition of the underlying natural
parameter matrices to identify shared and individual patterns for two
datasets. We also introduce a new measure of the strength of
association, and a permutationbased procedure to test its
significance. An alternating iteratively reweighted least squares
algorithm is devised for model fitting, and several variants are
developed to expedite computation and achieve variable selection. The
application to the Computer Audition Lab 500song (CAL500) music
annotation study sheds light on the relationship between acoustic
features and semantic annotations, and provides an effective means for
automatic annotation and music retrieval.

Goodnessoffit Testing of Error Distribution in Linear Measurement Error Models 
Hira Koul Michigan State University In this talk we shall discuss a class of goodnessoffit tests for the error density function in linear measurement errors regression models using a deconvolution kernel density estimators of the regression model error density. The test statistic is an analog of the Bickel and Rosenblatt type test statistic. The asymptotic null distribution of the proposed test statistics is derived for both the ordinary smooth and super smooth cases. The consistency against a fixed alternative and the asymptotic power of the proposed tests against a class of local nonparametric alternatives are also obtained for both cases. A finite sample simulation study shows some superiority of the proposed test compared to very few other existing tests. Joint work with Weixing Song and Xiaoyu Zhu.

Bayesian Penalized Regression (and a little MCMC) 
Galin Jones University of Minnesota I will consider ordinary least squares, lasso, bridge, and ridge regression methods under a unified framework. The particular method is determined by the form of the penalty term, which is typically chosen by cross validation. The goal is to introduce a fully Bayesian approach which allows selection of the penalty through posterior inference if desired and discuss how to use a type of model averaging approach to eliminate the nuisance penalty parameters. Sufficient conditions for the posterior to concentrate near the true regression coefficients as the dimension grows with sample size will be discussed. The resulting posterior is analytically intractable and requires a componentwise Markov chain Monte Carlo algorithm. The MCMC estimation problem is highly multivariate, an issue which has been largely ignored in the MCMC literature. A new relativevolume simulation termination rule will be introduced and connected to a new concept of effective sample size. This allows termination of the simulation in a principled manner. Numerical results show that the proposed model and MCMC
method tends to select the optimal penalty and performs well in both
variable selection and prediction. Examples will be provided.

Classification with many classes: challenges and pluses 
Mariana Pensky University of Central Florida We consider highdimensional multiclass classification of normal vectors, where unlike standard assumptions, the number of classes may be also large. We derive the (nonasymptotic) conditions on effects of significant features, and the low and the upper bounds for distances between classes required for successful feature selection and classification with a given accuracy. Furthermore, we study an asymptotic setup where the number of classes is growing with the dimension of feature space and sample sizes. To the best of our knowledge, our paper is the first to study this important model. In particular, we present an interesting and, at first glance, somewhat counterintuitive phenomenon that the precision of classification can improve as the number of classes grows.

Outcome Identification in Electronic Health Records using Predictions from an Enriched Dirichlet Process Mixture 
Jason Roy
(University of Pennsylvania) We propose a novel semiparametric model for the joint distribution of a continuous longitudinal outcome and the baseline covariates using an enriched Dirichlet process (EDP) prior. This joint model decomposes into a linear mixed model for the outcome given the covariates and marginals for the covariates. The nonparametric EDP prior is placed on the regression and spline coefficients, the error variance, and the parameters governing the predictor space. We predict the outcome at unobserved time points for subjects with data at other time points as well as for completely new subjects with covariates only. We find improved prediction over mixed models with Dirichlet process (DP) priors when there are a large number of covariates. Our method is demonstrated with electronic health records consisting of initiators of second generation antipsychotic medications, which are known to increase the risk of diabetes. We use our model to predict laboratory values indicative of diabetes for each individual and assess incidence of suspected diabetes from the predicted dataset. Our model also serves as a functional clustering algorithm in which subjects are clustered into groups with similar longitudinal trajectories of the outcome over time.

An MCMC Approach to Empirical Bayes Inference and Bayesian Sensitivity Analysis via Empirical Processes 
Hani Doss
(University of Florida) We consider situations in Bayesian analysis where the prior pi_h on the parameter theta is indexed by a continuous hyperparameter h, and we deal with two related problems. The first problem is as follows. Let m(h) be the marginal likelihood of the data (this is the likelihood of the data with theta integrated out with respect to the prior pi_h). The problem is to construct a confidence interval (or region if h is multidimensional) for argmax_h m(h), the value of h for which the marginal likelihood of the data is largest. This value of h is, by definition, the empirical Bayes choice of h. If for each h, \hat{m}(h) is an estimate of m(h)typically obtained via MCMCthen we may estimate argmax_h m(h) via argmax_h \hat{m}(h). The second problem is as follows. Suppose we fix a function g of theta. Let I(h) be the posterior expectation of g(theta) when the hyperparameter of the prior is h. This is also typically estimated via MCMC. The problem is to construct confidence bands for I(h) that are valid simultaneously for all h. The first problem is in some sense a model selection problem, and the second is a form of Bayesian sensitivity analysis. The two problems are actually closely related in that to solve either of them we need uniformity (in h) of the convergence of the estimates. We show how tools from various parts of probability theory can be used to deal with these two problems. The methodology we develop applies very generally. As an illustration, we consider Latent Dirichlet Allocation, which is heavily used in topic modelling. The twodimensional hyperparameter that governs this model has a large impact on inference. We show how our methodology can be used to select it. We also show how to form globallyvalid "twodimensional confidence bands" for certain posterior probabilities of interest as h varies continuously, and give an illustration on a corpus consisting of a set of articles from Wikipedia. Joint work with Yeonhee Park (MD Anderson Cancer Center). To view the abstract in pdf form, click here. 
Assessing publication bias in metaanalysis 
Lifeng Lin
(Florida State University) Publication bias is a serious problem in systematic reviews and metaanalyses, which can jeopardize the validity and generalization of conclusions. Current approaches to dealing with publication bias can be distinguished into two classes: selection models and funnelplotbased methods. Although various statistical methods are available to assess publication bias, their performance has not been systematically evaluated by comprehensive real applications. We compare several commonlyused publication bias tests and evaluate their agreements based on a large collection of metaanalyses from the Cochrane Library. The agreements among most tests are found to be low or moderate. In addition, so far it is popular to test for publication bias, while measures for quantifying publication bias are seldom studied in the literature. Such measures can be used as a characteristic of a metaanalysis, and they permit comparisons of publication biases between different metaanalyses. Eggerâ€™s regression intercept is a candidate measure, but it lacks an intuitive interpretation. We introduce a new measure, the skewness of the standardized deviates, to quantify publication bias. This measure describes the asymmetry of the collected studiesâ€™ distribution. Also, a new test for publication bias is derived based on the skewness. Large sample properties of the new measure are studied, and its performance is illustrated using simulations and case studies.

Entity Resolution with Societal Impacts in Statistical Machine Learning 
Rebecca Steorts
(Duke University) Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions. In practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns. Analysts need to use methods from statistical and computational science known as entity resolution (record linkage or deduplication) to proceed with analysis. Entity resolution is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself. In this talk, we describe the past and present challenges with entity resolution, with applications to the Syrian conflict but also official statistics, and the food and music industry. This large collaboration touches on research that is crucial to problems with societal impacts that are at the forefront of both national and international news. Meeting Schedule 