Bani Mallick, Texas A&M University

Bayesian nonlinear regression and classification in the ``Large p, Small n" problems

We propose nonlinear Bayesian models for problems in which sample sizes are substantially smaller than the number of available predictor variables. Novel models for regression and classification are needed to handle these challenging problems. We consider Bayesian regression and classification methods based on kernel reproducing Hilbert spaces. We consider the Gaussian likelihood, logistic likelihood as well as likelihoods related to the Support Vector Machine (SVM) models. We develop MCMC based computation methods and exploit it to perform posterior and predictive inferences. The regression problem is motivated by calibration problems in near-infrared (NIR) spectroscopy. We consider the non-linear regression setting in which many predictor variables arise from sampling an essentially continuous curve at equally spaced points and there may be multiple predictants. Precise classification of tumors is critical for cancer diagnosis and treatment. Diagnostic pathology has traditionally relied on macro and microscopic histology and tumor morphology as the basis for tumor classification. In recent years, there has been a move towards the use of cDNA microarrays for tumor classification. These high-throughput assays provide relative mRNA expression measurements simultaneously for thousands of genes. We will use our nonlinear models to perform classification via different gene expression patterns.
Seminar page

Kenneth Portier, University of Florida

The 95% Upper Confidence Bound for the Mean of Lognormal Data

The lognormal distribution is the most commonly used probability density model for environmental contaminant data. In environmental exposure assessment, the average concentration of the contaminant at a site is of direct interest, but typically the 95% upper confidence bound (UCL) is used as a conservative estimate of average concentration for subsequent risk calculations. In this talk ten (10) different estimators used to compute the UCL will be presented and discussed. Results on a Monte Carlo simulation as well as comparisons using soil concentrations of arsenic from a Florida site will be discussed.
Seminar page

Clyde Schoolfield, University of Florida

Generating a random signed permutation with random reversals

Signed permutations form a group known as the hyperoctahedral group. We determine the mixing time for a certain random walk on the hyperoctahedral group that is generated by random reversals. Specifically, we show that $O(n \log n)$ steps are both necessary and sufficient for total variation distance to uniformity to become small. This research was motivated by an effort in mathematical biology to model the evolutionary changes in gene order.
Seminar page

Dennis Lin, Penn State University

Recent Advances in Design of Experiment

Design of Experiment seeks an efficient way to collect useful information. In the past decade, we have witnessed the resolution of information technology. Many classical methods in design of experiment originated from agricultural problems may not be appropriate for the informatic era. This talk attempts to explore some important issues in design of experiment that deserve immediate attention for future research. Topics to be discussed include supersaturated design, uniform design, computer experiment, optimal foldover plan, dispersion analysis, and multi-response surface methodology. For each subject the problem will be introduced, some initial results will be presented, and future research problems will be suggested.
Seminar page

Amit Bhattacharyya and Bernie Dharan, GlaxoSmithKline

Statistical Work Experiences in a Major Pharmaceutical Company

This talk will highlight the different phases in pharmaceutical drug discovery and development and the potential impact a statistician can attain in collaboration with other scientific colleagues. The talk will enlist the statistical techniques used most regularly in pre-clinical, early phase and late-phase human trials. Internship opportunities for students will also be discussed.
Seminar page

Brett Presnell, University of Florida

The In-and-Out-of-Sample (IOS) Likelihood Ratio Test for Model Misspecification

A new test of model misspecification is proposed, based on the ratio of in-sample and out-of-sample likelihoods. The test is broadly applicable, and in simple problems approximates well known, intuitive methods. Using jackknife influence curve approximations, it is shown that the test statistic can be viewed asymptotically as a multiplicative contrast between two estimates of the information matrix that are equal under correct model specification. This approximation is used to show that the statistic is asymptotically normally distributed, though it is suggested that p-values be computed using the parametric bootstrap. The resulting methodology is demonstrated with a variety of examples and simulations involving both discrete and continuous data. This is joint work with Dennis Boos of North Carolina State University.
Seminar page

Debajyoti Sinha, Medical University of South Carolina

Analysis of Recurrent Events Time Data with Dependent Termination Time

We consider modeling and Bayesian analysis for recurrent events data when the termination time for each subject may depend on the history of the recurrent events. We propose a fully specified stochastic model for the joint distribution of the recurrent events and the termination time. For this model, we provide a natural motivation, derive several novel properties and develop a Bayesian analysis based on a Markov chain Monte Carlo algorithm. Comparisons are made to the existing models and methods for recurrent events and panel count data. We demonstrate the usefulness of our new models and methodologies through the reanalysis of a dataset from a clinical trial.
Seminar page

John Nelder, Imperial College London

Extended Likelihood Inference Applied to a New Class of Models - A Two Part Talk

Talk I
Random-effect models require an extension of Fisher likelihood. Extended likelihood (Pawitan) or, equivalently, h-likelihood (Lee & Nelder), provide a basis for likelihood inference applicable to random-effect models. The model class, called hierarchical generalized linear models (HGLMs), is derived from generalized linear models (GLMs). It supports (1) joint modelling of mean and dispersion; (2) GLM errors for the response; (3) random effects in the linear predictor for the mean, with distributions following any conjugate distribution of a GLM distribution; (4) structured dispersion components depending on covariates. Fitting of fixed and random effects, given dispersion components, reduces to fitting an augmented GLM, while fitting dispersion components, given fixed and random effects, uses an adjusted profile h-likelihood and reduces to a second interlinked GLM, which generalizes REML to all the GLM distributions. A single algorithm can fit all members of the class and does not require either the multiple quadrature necessary for methods using marginal likelihood, or prior distributions as used in Bayesian methods. Model checking also generalizes from GLMs and allows the visual checking of all aspects of the model. Software in the form of Genstat procedures will be used for illustrative examples.
Talk II
The model class of Talk I is extended to cover correlated data expressed by random effects in the model, thus allowing fitting of spatial and temporal models with GLM errors. Correlations can be expressed by transformations of white noise, by structured covariance matrices, or by structured precision matrices. An important subclass can be expressed in terms of dispersion components only, allowing the generalization of the analysis of balanced designs with normal errors to the GLM class of distributions. Finally the class can be extended to double HGLMs, which allow random effects in the dispersion model as well as in the mean. Analysis is still by means of interlinked extended GLMs and GLMs. This leads, among other things, to a potentially large expansion of classes of models used in finance, the properties of which have still to be investigated.
Seminar page

Ramon Littell, University of Florida

Issues in Analysis of Unbalanced Mixed Model Data

Major transition has occurred in recent years in statistical methods for analysis of linear mixed model data from analysis of variance (ANOVA) to likelihood-based methods. Prior to the early 1990's, most applications used ANOVA because computer software was either not available or not easy to use for likelihood-based methods. ANOVA is based on ordinary least squares computations, with adoptions for mixed models. Computer programs for such methodology were plagued with technical problems of estimability, weighting, and handling missing data. Likelihood-based methods mainly utilize a combination of residual maximum likelihood (REML) estimation of covariance parameters and generalized least squares (GLS) estimation of mean parameters. Software for REML/GLS methods became readily available early in the 1990's, but the methodology still is not universally embraced. Although many of the computational inadequacies have been overcome, conceptual problems remain. This talk will identify certain problems with ANOVA, and describe which remain and which are resolved with REML/GLS.
Seminar page

Marie Davidian, North Carolina State University

"Semiparametric" Approaches for Inference in Joint Models for Longitudinal and Time-to-Event Data

A common objective in longitudinal studies is to characterize the relationship between a longitudinal response process and a time-to-event. Considerable recent interest has focused on so-called joint models, where models for the event time distribution (typically proportional hazards) and longitudinal data are taken to depend on a common set of latent random effects, which are usually assumed to follow a multivariate normal distribution. A natural concern is sensitivity to violation of this assumption. I will review the rationale for and development of joint models and discuss two modeling and inference approaches that require no or only mild assumptions on the random effects distribution. In this sense, the models and methods are "semiparametric." The methods will be demonstrated by application to data from an HIV clinical trial.
Seminar page

Marie Davidian, North Carolina State University

Challis Lecture: As Time Goes By -- An Introduction to Methods for Analysis of Longitudinal Data

Studies in which data are collected over time repeatedly on each of a number of individuals are ubiquitous in health sciences research. Over the past few decades, fundamental advances in the development of statistical methods to analyze such longitudinal data have been made. However, although these methods are widely-accepted by research statisticians, they are less well-known among biomedical researchers, who thus may be reluctant to embrace them. In this talk, I will discuss the types of scientific questions that may be of interest in longitudinal studies, the rationale for the need for specialized methods for longitudinal data analysis, and the basic elements of popular longitudinal data methods and their advantages over cross-sectional and ad hoc approaches. There will be some, but not many, equations, but lots of pictures and examples drawn from collaborations in cardiology, pharmacology, HIV research, and other substantive fields.
Seminar page

Saurabh Ghosh, Washington University

Non-parametric Alternatives For Mapping Quantitative Trait Loci: Some Statistical Comparisons And Applications To Alcohol-related Phenotypes in COGA

Unlike qualitative or binary traits which can be characterized completely by allele frequencies and genotypic penetrances, quantitative traits require an additional level of modeling: the probability distribution of the underlying trait. Hence, likelihood based methods like variance components, which require assumptions like multivariate normality of trait values within a family, may yield misleading linkage inferences when underlying model assumptions are violated. The Haseman-Elston regression method (1972) and its extensions do not assume any specific probability distribution for the trait values, but assume a linear relationship between the squared sib-pair trait differences (or mean-corrected cross products of sib-pair trait values) and the estimated identity-by-descent scores at a marker locus. Since it is often difficult to test the validity of these assumptions, it is of interest to explore for non-parametric alternatives. Ghosh and Majumder (2000) have therefore proposed that it may be strategically more judicious to empirically estimate the nature of dependence of the two above-mentioned variables using non-parametric diagnostics like rank correlation or kernel-smoothing regression. In this study, we modify our earlier methodologies to multipoint mapping and compare their performances to the linear regression procedures of Elston et al. (2000). We find that while the non-parametric regression method is marginally less powerful than the linear regression methods in the absence of dominance, it performs increasingly better as dominance increases. The non-parametric method also outperforms the linear regression procedures with increasing deviation of the distribution of trait values from normality. We have used the non-parametric regression method to analyze data on Beta 2 EEG waves and externalizing scores collected in the Collaborative Study on the Genetics of Alcoholism (COGA) project. We have obtained statistically significant signals of linkage on Chromosomes 1, 4, 5 and 15. We have also investigated the presence of epistatic interactions between regions exhibiting significant linkage. Evidence of epistasis was found between regions on Chromosomes 1 and 4 with those on Chromosome 15.
Seminar page