Jim Hobert, University of Florida

Perfect Sampling: Basic Ideas and a Recent Result

A perfect sampler is an algorithm that allows one to use a Markov chain with stationary density $\pi$ to make exact (or perfect) draws from $\pi$. A simple, three-state Markov chain is used to explain the perfect sampling algorithm called coupling from the past (CFTP) (Propp & Wilson 1996). Extending CFTP to Markov chains with uncountable state spaces has proved difficult. One success story is Murdoch & Green's (1998) multigamma coupler, which is based on the fact that a minorization condition can be used to represent the Markov transition density as a two-component mixture. The multigamma coupler is illustrated using a Markov chain from Diaconis & Freedman (1999). Our main result is a representation of $\pi$ as an infinite mixture that is based on a minorization condition. When the minorization condition is of a certain type, it is possible to make exact draws from this mixture and hence from $\pi$. The resulting algorithm turns out to be equivalent to the multigamma coupler. (This is joint work with Christian Robert, Universite Paris - Dauphine.)
Seminar page

Ron Marks, University of Florida

Web-Based Clinical Research

There are a number of well known issues in the conduct of medical research that lead to inefficiencies and loss of accuracy in clinical research data. This talk will summarize some of the current problems in clinical trials research, and proposed solutions offered through web-based technology. A UF developed web-based clinical research system will be described. The UF system is being used to conduct the world's largest and longest running web-based clinical trial to date. INVEST is a 22,599 hypertension subject Phase IV clinical trial being conducted at 870 primary care medical sites in nine countries, being conducted by the Divisions of Cardiology and Biostatistics. This talk will illustrate how clinical research will be transformed by process change available through web-based technology.
Seminar page

Alex Trindade, University of Florida

Assessing the Performance of Burg Algorithms in Fitting Multivariate Subset Autoregressions

We present three new algorithms that extend Burg's original method for the recursive fitting of univariate autoregressions on a full-set of lags, to multivariate modeling on a subset of lags. The algorithms differ only in the manner in which the reflection coefficients are computed, one such adjustment leading to the well-known Yule-Walker method. Using simulated data, we show that two of these algorithms tend to be superior performers for their respective fitted models, by averaging higher likelihoods with smaller variability across a large number of realizations. To better evaluate this difference in performance, we compare saddlepoint approximations to the distributions of the Yule-Walker and Burg algorithms in a simple univariate setting. In this context, each estimator can be written as a ratio of quadratic forms in normal random variables. The smaller bias and variance seen in the distribution of Burg, particularly at low sample sizes and when the autoregressive coefficient is close to $\pm 1$, agrees with its tendency to give higher likelihoods. We speculate that a possible reason for its enhanced performance, might be connected to the discovery that the Burg estimator for the white noise variance of the process coincides, in many cases, with that obtained by maximizing the likelihood.
Seminar page

Alan Agresti, University of Florida

Dealing with Discreteness: Making `Exact' Confidence Intervals for Proportions, Differences of Proportions, and Odds Ratios More Exact

`Exact' methods for categorical data are exact in terms of using probability distributions that do not depend on unknown parameters. However, they are conservative inferentially, having actual error probabilities for tests and confidence intervals that are bounded above by the nominal level. We examine the conservatism for interval estimation and suggest ways of reducing it. We illustrate for several parameters of interest with contingency tables, including the binomial parameter, the difference between two binomial parameters, the odds ratio and relative risk in a $2\times 2$ table, and the common odds ratio for several such tables. Less conservative behavior results from devices such as (1) inverting tests using statistics that are "less discrete," (2) inverting a single two-sided test rather than two separate one-sided tests of at least half the nominal level each, (3) using unconditional rather than conditional methods (where appropriate) and (4) inverting tests using alternative P-values. We also summarize simple ways of adjusting standard large-sample methods to improve dramatically their small-sample performance.
Seminar page

Andre Khuri, University of Florida

Comparison of Designs for Generalized Linear Models Using Quantile Disperion Graphs

Designs for generalized linear models depend on the unknown parameters of the fitted model. The use of any design optimality criterion would therefore require some prior knowledge of the parameters. In this talk, a graphical technique is proposed for comparing and evaluating designs for a logistic regression model using the so-called quantile dispersion graphs of the scaled mean squared error of prediction.. These plots depict the dependence of a given design on the model's parameters. They also provide a comprehensive assessment of the overall prediction capability of the design within the region of interest. Some examples will be presented to illustrate the proposed methodology.
Seminar page

Andrea Foulkes, Harvard School of Public Health

Characterizing Classes of Antiretroviral Therapies by Genotype

The research presented in this talk is intended to establish a framework for understanding the complex relationships between HIV-1 genotypic markers of resistance to antiretroviral drugs and clinical measures of disease progression. Antiretroviral therapies have demonstrated a powerful ability to lower the level of HIV-1 in plasma and delay the onset of clinical disease and death. Unfortunately, resistance to these therapies is often rapidly acquired, reducing or eliminating their usefulness. Making decisions about the next best treatment for a patient will inevitably depend on the specific genotypic and phenotypic characteristics of the infecting viral population. A new classification scheme based on the probabilities of how new patients will respond to therapy given the available data is proposed as a method for distinguishing among groups of viral sequences. This approach draws from existing cluster analysis, discriminant analysis and recursive partitioning techniques and requires a model relating genotypic characteristics to phenotypic response. A dataset of 2746 sequences and the corresponding Indinavir and Nelfinavir $IC_{50}$s are described and used for illustrative purposes.
Seminar page

Patches Johnson, University of Florida

Nonlinear Path Models with Continuous or Dichotomous Variables

Path models are useful for describing inter-relationships among causally ordered random variables. Because the sequence of variables is assumed to be causally ordered, each variable can have both a direct effect on any subsequent variable in the chain and/or an indirect effect through its influence on intermediate variables within the causal chain. Classical approaches to analyzing path models allow only linear equations with continuous variables. In this work, the traditional methodology for studying and analyzing linear path models with continuous variables is extended. Methodology to analyze path models with nonlinear relationships is developed, as well as methodology for models containing dichotomous variables. By extending classical methodology, we develop a ``Calculus of Effects'' which is applicable to a broader scope of models than the traditional ``Calculus of Coefficients''. An application to path models in the field of maternal and child health is included.
Seminar page

Morgan Wang, University of Central Florida

Data Mining Techniques for Mortality at Advanced Age

This paper addresses issues and techniques for advanced age mortality study using data mining techniques, a new technology on the horizon with great actuarial potential. Data mining is an information discovery process that includes data acquisition, data integration, data exploration, model building, and model validation. Both expert opinion and information discovery techniques are integrated together to guild each step in the information discovery process. Seven factors were considered in this study and the influences of these factors on advanced-age mortality distribution were identified with exploratory data analysis and decision tree algorithm. Models to address their effects on advanced age mortality were built with logistic regression technique. These models will be derived for projecting advanced age mortality distribution.
Seminar page

Rongling Wu, University of Florida

Functional Mapping of Quantitative Trait Loci Affecting Growth Trajectories

Growth trajectories, morphological shapes, and norms of reaction are regarded as infinite-dimensional characters in which the phenotype of an individual is described by a function, rather than by a finite set of measurements. We present an innovative statistical strategy for mapping quantitative trait loci (QTL) underlying infinite-dimensional characters. This strategy, termed functional mapping, integrates mathematical relationships of different traits or variables within the statistical mapping framework. Logistic mapping presented in this talk can be viewed as an example of functional mapping. Logistic mapping is based on a universal biological law that growth for each and every living organism follows a logistic or S-shaped curve with time. A maximum likelihood approach based on a logistic-mixture model, implemented with the EM algorithm, is developed to provide the estimates of QTL positions, QTL effects and other model parameters responsible for growth trajectories. Although logistic mapping is statistically simple, it displays tremendous potential to increase the power of QTL detection, the precision of parameter estimation and the resolution of QTL localization due to the pleiotropic effect of a QTL on growth and/or residual correlations of growth at different ages. More importantly, logistic mapping allows for the testing of numerous biologically important hypotheses concerning the genetic basis of quantitative variation, thus gaining an insight into the critical role of development in shaping plant and animal evolution and domestication. The power of logistic mapping is demonstrated by an example of a forest tree, in which a number of QTL affecting stem growth processes are detected. The advantages of functional mapping are discussed. .
Seminar page

Randy Carter, University of Florida

A Latent Class Model Analysis of Case Ascertainment

The Florida Legislature mandated and funded development of the Florida Birth Defects Registry (FBDR) in 1997. A consortium of state universities planned and built the registry in 1998-99, under contract with the State Department of Health, Bureau of Environmental Epidemiology. The purpose of the registry was to facilitate detection, investigation, and prevention of birth defects in Florida. The registry was based on retrospective surveillance of four statewide databases: Birth Vital Statistics (BVS), the birth hospitalization discharge database of the Agency for Health Care Administration (AHCA), the Children's Medical Services (CMS) Early Intervention Program (EIP) data system, and the CMS Regionalized Perinatal Intensive Care Centers' (RPICC) data system. Each of these source datasets was searched for diagnostic and/or procedure codes that identified children with birth defects. Cases ascertained in this way were accumulated to form the FBDR. Two goals of the consortium were to investigate the accuracy of case ascertainment by each source, and overall, and to estimate the prevalence of birth defects in Florida. Unfortunately, each ascertainment was subject to error and the true status of each child was unknown, thereby complicating the estimation of prevalence and of the accuracy parameters: sensitivity, specificity, positive predictive value, and negative predictive value. In the absence of a perfect indicator of birth defects, latent class model analysis was used to estimate these parameters based on inter-agreement among the imperfect indicators obtained from each source database. The results showed that only the AHCA dataset had high sensitivity and specificity, 0.82 and 0.96 respectively. BVS, EIP, and RPICC had high specificities, 0.99, 0.99, and 0.99, but low sensitivities, 0.30, 0.15, and 0.16 respectively. Overall, the FBDR surveillance system had an estimated 91% sensitivity and 96% specificity for ascertaining cases correctly. These estimates were validated in a separate small-scale study in which the true status of each child was known. The estimated prevalence of birth defects was 2%. Based on the results of the validation study, this appeared to be an underestimate of true prevalence. This talk will focus on the statistical issues faced in this and related situations where several imperfect diagnostic tests are used in the absence of a true indicator of disease status. Keywords are: Diagnostic tests, errors in variables, latent class model analysis, conditional independence, estimated generalized non-linear least squares estimation, Bayes' rule.
Seminar page

Alex Kugashev, CyberGnostics, Inc.

CyberStats, an Online Introductory Statistics Course

Alex is the developer of CyberStats and will use it to demonstrate Web-based pedagogy in statistics and course management. CyberStats contains over 500 active simulations and calculations and hundreds of immediate-feedback practice items. NSF-supported CyberStats 2.0 reflects extensive and successful classroom use and is equally applicable to on-campus and to distance learning courses.
Seminar page

Elias Moreno, University of Granada (Spain)

Intrinsic Priors in Problems with a Change-Point

The Bayesian formulation of the changepoint problem involves priors for discrete and continuous parameters. When the prior information is vague a default Bayesian analysis might be useful and presents some difficulties that can be solved with the use of intrinsic priors. In this paper a default Bayesian model selection approach is considered to the problem of making inference on the point in a sequence of random variables at which the underlying distribution changes. Inferences are based on the posterior probabilities of the possible changepoints. However, these probabilities depend on Bayes factors for which improper default priors for the parameters leave the Bayes factors defined up to a multiplicative constant. To overcome that difficulty intrinsic priors arising from the conventional priors are considered. With intrinsic priors the posterior distribution of the changepoint and the size of the change can be computed. The results are applied to some common sampling distributions and illustrations to some much studied dataset are given.
Seminar page

Tatsuya Kubokawa, University of Tokyo

Empirical and Generalized Bayes Ridge Regression Estimators with Minimaxity and Stability

I will talk about the classical problem of estimating the regression parameters in a multiple linear regression model when the multicollinearity is present. The least squares estimator (LSE) is instable, and one candidate of stabilized procedure is the ridge regression estimator with parameter k, which is not minimax, however. Also it includes arbitrariness of k, so that k may be estimated from the data. However it is known that such adaptive ridge regression estimators do not satisfy the conditions for the minimaxity under the squared loss in the multicollinearity cases (Casella(1980)). In this talk, I will employ a weighted squared loss suggested by Strawderman (1978) instead of the usual squared loss, and derive conditions for adaptive ridge regression estimators to be better than the LSE, namely, minimax. Especially, the empirical Bayes estimator with estimating the parameter k by the root of the marginal likelihood equation is shown to satisfy the minimaxity and stability in the multicolliearity cases and to have very nice risk-performances even for the usual squared loss. The usefulness of the empirical Bayes estimator will be also illustrated through an example. As another candidate with stability, I will present the generalized Bayes estimator against a natural prior and give conditions for the minimaxity. Hence admissible, minimax and stabilized estimators can be provided.
Seminar page

Haoyi Chen, University of Florida

Piecewise Gompertz Model on Solving Cure Rate Problem

In cancer research, only some of the patients are disease-free after certain treatments. It is interesting to compare treatment efficacy with long-term survival rates. One commonly used approach in analyzing this type of data is to compare the Kaplan-Meier estimates of the cure rate. Another approach is to apply the mixture model proposed by Farewell. However, the Kaplan-Meier estimates are unstable toward the end point, while the mixture model is computationally too complex. To overcome these difficulties, we propose a test based on a piecewise Gompertz model to compare drug efficacy with the cure rate. The proposed test also accommodates the situation where patients display different hazard patterns during different treatment stages. In this work, we have derived the strict concavity of the log-likelihood function and the existence, consistency, and asymptotic normality of the maximum likelihood estimates of the parameters. In addition, our Monte Carlo simulation study shows that the proposed test is more computationally feasible and more powerful than the test based on Farewell's mixture model. Moreover, an example is given to show the utility of our proposed test. A goodness-of-fit test of our proposed model is also discussed.
Seminar page

Deng-Shan Shiau, University of Florida

Signal Identification and Forecasting in Nonstationary Time Series Data

Traditional time series analysis focuses on finding the optimal model to fit the data in a learning period and using this model to make predictions in a future period. However, many practical applications, such as earthquake time series or epileptic brain electroencephalogram (EEG) time series, may only contain a few meaningful, or predictable patterns, which can be used for meaningful forecasting such as the occurrences of some specific events following similar patterns. In these cases, the traditional time series model such as the autoregressive ($AR$) model usually gives poor predictions since the model is constructed to fit the entire learning period, while the pattern useful for prediction may occur during only a small portion of the period. The purpose of this research is to provide a statistical algorithm to identify the most predictable pattern in a given time series and to apply this pattern to make predictions. In this dissertation, we propose the Pattern Match Signal Identification (PMSI) algorithm to identify the most predictable pattern in a given time series. In this algorithm, the concept of the pattern match is used instead of the generally used value match criterion. The most predictable pattern is then identified by the significance of a test statistic. The feasibility of this algorithm is proved analytically and is confirmed by simulation studies. An epileptic brain EEG time series and the well known Wolf's monthly sunspot time series are used as applications of this algorithm. A forecasting method based on the identified pattern by the PMSI algorithm is introduced. Multivariate regression models are applied to subsequences in the learning period with the most predictable patterns, and these regression equations are used to make predictions in a future period. The performance of this method is compared with the one by the autoregressive ($AR$) models. The two applications (EEG and sunspot time series) show that the proposed forecasting method gives significantly better predictions than $AR$ models, especially for more step ahead predictions.
Seminar page

Stephen Senn, University College London

Two Cheers for P-Values

P-values are a practical success but a critical failure. Scientists the world over use them but scarcely a statistician can be found to defend them. Bayesians in particular find them ridiculous but even the modern frequentist has little time for them. The invention of P-values is often mistakenly ascribed to RA Fisher but in fact they are far older, dating back at least as far Daniel Bernoulli's significance test of 1734 regarding the inclinations of the planetary orbits. The Bayesian Karl Pearson also used them in his famous paper of 1900 on the chi-square goodness of fit test, some 25 years before the publication of Fisher's influential Statistical Methods for Research Workers. Recently there has been a growing campaign against their use in medical statistics. The journal Epidemiology has even banned them. Bayesian critics have drawn attention to the fact that a just significant result has a moderate replication probability whilst failing to note that this is a desirable and necessary property shared by Bayesian statements. P-values have even been attacked in the popular press.In this talk I shall consider whether there are any grounds for continuing to use this ubiquitous but despised device.
Seminar page