## Jeffrey Thompson, University of Florida

### Estimation of Generalized Simple Measurement Error Models with Instrumental Variables

Measurement error (ME) models are used in situations where at least one independent variable in the model is imprecisely measured. Having at least one independent variable measured with error leads to an unidentified model and a bias in the naive estimate of the effect of the variable that is measured with error. One way to correct these problems is through the use of an instrumental variable (IV). An IV is one that is correlated with the unknown, or latent, true variable, but uncorrelated with the measurement error of the unknown truth and the model error. An IV provides the identifying information in our method of estimating the parameters for generalized simple measurement error (GSME) models. The GSME model is developed and it is shown how many well studied ME models with one predictor can fit into its framework. Included in these are linear, generalized linear, nonlinear, multinomial, multivariate regression, and other ME models. The GSME model, by design, can handle situation for continuous, discrete, and categorical observable, or manifest, variables. We provide theorems that give conditions under which the GSME model is identified. The initial step in our estimation method is to "categorize" all continuous and discrete variables. Categorical variables remain unchanged. Assuming conditional independence given the latent variable, the joint distribution of the categorized manifest variables and any that were already categorical is the product of the conditional cell probabilities and conditional distributions of the categorized continuous and discrete manifest variables summing over the categorical values of the latent variable. Maximum likelihood estimates of the joint categorical distribution are used to solve nonlinear equations for the parameters of interest which enter through the conditional probabilities. Estimated generalized nonlinear least squares is used to solve the equations for the parameters of interest. We show that our estimators have favorable asymptotic properties and develop methods of inference for them. We show how many commonly studied ME model problems fit into the general framework developed and how they can be solved using our method.

## Brett Presnell, University of Florida

### The In-and-Out-of-Sample (IOS) Likelihood Ratio Test for Model Misspecification

A new test of model misspecification is proposed, based on the ratio of in-sample and out-of-sample likelihoods. The test is broadly applicable, and in simple problems approximates well known and intuitively appealing methods. Using jackknife influence curve approximations, it is shown that the test statistic can be viewed asymptotically as a multiplicative contrast between two estimates of the information matrix, both of which are consistent under correct model specification. This approximation is used to show that the statistic is asymptotically normally distributed, though it is suggested that p-values be computed using the parametric bootstrap. The resulting methodology is demonstrated with a variety of examples and simulations involving both discrete and continuous data. This is joint work with Dennis Boos.
Seminar page

## Robert M. Dorazio, United States Geological Survey

### Estimating Size and Composition of Biological Communities by Modeling the Occurrence of Species

We develop a model that uses repeated observations of a biological community to estimate the number and composition of species in the community. Estimators of community-level attributes are constructed from model-based estimators of occurrence of individual species that also incorporate imperfect detection of individuals. Data from the North American Breeding Bird Survey are analyzed to illustrate the variety of ecologically-important quantities that are easily constructed and estimated using our model-based estimators of species occurrence. In particular, we compute site-specific estimates of species richness that honor classical notions of species-area relationships. Extensions of our model may be used to estimate maps of occurrence of individual species and to compute inferences related to the temporal and spatial dynamics of biological communities.
Seminar page

## C.R. Rao, Pennsylvania State University

### Anti-eigen and anti-singular values of a matrix and applications to problems in statistics

Let $A$ be a $p\times p$ positive definite matrix. A nonzero $p$-vector $x$ such that $Ax=\lambda x$ is called an eigenvector with the associated with eigenvalue $\lambda$. Equivalent characterizations are: (i) $\cos \theta=1$, where $\theta$ is the angle between $x$ and $Ax$. item{(ii)} $(x^\prime Ax)^{-1}=xA^{-1}x$. item{(iii)} $\cos \Phi=1$, where $\phi$ is the angle between $A^{1/2}x$ and $A^{-1/2}x$. We ask the question what is $x$ such that $\cos\theta$ as defined in (i) is a minimum or the angle of separation between $x$ and $Ax$ is a maximum. Such a vector is called an anti-eigenvector and $\cos\theta$ an anti-eigenvalue of $A$. This is the basis of operator trigonometry developed by K. Gustafson and P.D.K.M. Rao (1997), {\it Numerical Range: The Field of Values of Linear Operators and Matrices}, Springer. We may define a measure of departure from condition (ii) as $\min[(x^\prime Ax)(x^\prime A^{-1}x)]^{-1}$ which gives the same anti-eigenvalue. The same result holds if the maximum of the angle $\Phi$ between $A^{1/2}x$ and $A^{-1/2}x$ as in condition (iii) is sought. We define a hierarchical series of anti-eigenvalues, and also consider optimization problems associated with measures of separation between an $r(less than p)$ dimensional subspace $S$ and its transform $AS$. Similar problems are considered for a general matrix $A$ and its singular values leading to anti-singular values. Other possible definitions of anti-eigen and anti-singular values, and applications to problems in statistics will be presented.

Seminar page

## Myron Chang, University of Florida

### Score Statistics for Mapping Quantitative Trait Loci

We propose a method to detect the existence of a quantitative trait loci (QTL) in a backcross population using a score test.Under the null hypothesis of no QTL, all phenotype random variables are independent and identically distributed, and the maximum likelihood estimates (MLEs) of parameters in the model are usually easy to obtain. Since the score test only uses the MLEs of parameters under the null hypothesis, it is computationally simpler than the likelihood ratio test (LRT). Moreover, because the location parameter of the QTL is unidentifible under the null hypothesis, the distribution of the maximum of the LRT statistics, typically the statistic of choice for testing $H_0:$ no QTL, does not have the standard chi-square distribution asymptotically under the null hypothesis. From the simple structure of the score test statistics, the asymptotic null distribution can be derived for the maximum of the square of score test statistics. Numerical methods are proposed to compute the asymptotic null distribution and the critical thresholds can be obtained accordingly. A simple backcross design is used to demonstrate the application of the score test to QTL mapping. The proposed method can be readily extended to more complex situations. This is joint work with Rongling Wu, Samuel S. Wu and George Casella.
Seminar page

## Jim Hobert, University of Florida

### Honest MCMC via Drift and Minorization

I call an MCMC algorithm "honest" if it is possible to calculate standard errors using a valid central limit theorem along with a consistent estimate of the (unknown) asymptotic variance. In this talk, I will explain how drift and minorization conditions on the underlying Markov chain make honest MCMC possible. (This is joint work with Galin Jones, Brett Presnell and Jeffrey Rosenthal.)
Seminar page

## William Bell, United States Bureau of Census.

### Accounting for Uncertainty About Variances and Correlations in Small Area Estimation

Small area estimation often uses a linear model applied to direct survey small area estimates. The model includes regression effects, small area random effects (model error), and sampling error. Often the magnitude of the sampling error greatly exceeds that of the model error, resulting in considerable uncertainty about properties (variances, correlations) of the model error based on statistical inferences from the direct estimates data. In this situation, Bayesian techniques have important advantages over classical approaches for making inferences about the true quantities of interest. This is illustrated for the estimation of state age-group poverty rates in the Census Bureau's SAIPE (Small Area Income and Poverty Estimates) project. SAIPE applies a small area model to direct state poverty estimates from the Current Population Survey (CPS). Particular issues that arise for the SAIPE state model include (i) estimating the mean squared error of the model-based estimates, (ii) assessing whether borrowing strength from CPS estimates for the previous year or for a different age group can improve the estimates, and (iii) assessing whether borrowing strength from estimates from another survey (American Community Survey) will improve the poverty estimates.
Seminar page

## Trevor Park, University of Florida

### A Penalized Likelihood Approach to Principal Component Rotation

Principal component analysis, widely used to explore multidimensional data sets, is susceptible to high sampling variability. Sampling variability contributes to the difficulty frequently encountered in interpreting individual principal components. Rotation techniques (borrowed from factor analysis) and some recent alternatives have been proposed for producing more interpretable principal components, but none has a definite relationship to sampling variability. Principal components can be made simultaneously more stable and interpretable if estimated via penalized likelihood, with a penalty term that encourages the components to assume a simpler form. In contrast to other methods for principal component rotation, penalized likelihood can be more precisely controlled, naturally targets components that are ill-defined, and allows assessment of the statistical plausibility of the modified components. Simple variants of the penalty term can be used to preferentially rotate the major components or to respect special structure. The estimation requires an optimization subject to orthogonality constraints, for which efficient specialized algorithms have recently been developed. A Stiefel manifold variant of the Newton algorithm appears to work well for this problem and has potential applications to related statistical problems.
Seminar page

## Dongchu Sun, University of Missouri

### Characterizing and Eradicating Autocorrelation in MCMC Algorithms

In spite of increasing computing power, many statistical problems give rise to Markov chain Monte Carlo algorithms that mix too slowly to be useful. Most often, this poor mixing is due to high posterior correlations between parameters. In the illustrative special case of Gaussian linear models with known variances, we characterize which functions of parameters mix poorly and demonstrate a practical way of removing all autocorrelations from the algorithms by adding Gibbs steps in suitable directions. These additional steps are also very effective in practical problems with nonnormal posteriors, and are particularly easy to implement.
Seminar page

## Bikas K. Sinha, Indian Statistical Institute

### On Some Probabilistic and Statistical Aspects of Discovery of New Species

Following Chao (Ann Stat., 1984), Chao and Lee (Jour Amer Stat Assoc, 1992) and Sinha and Sengupta (Cal Stat Assoc Bull, 1993), we consider the problem of discovery of ALL species in a species search problem under an infinite population set-up. Assuming that there are N species with abundance rates p_1 >= p_2 >= etc >= p_N, we work out the probability PHI(t, N, P_) that more than t successive attempts are needed to discover all the species. Denoting by phi(p,t,N) the probability of discovery of ALL species in t successive attempts with the species having abundance rate p being observed last, we establish the intuitive claim: phi(p,t,N)>=phi(q,t,N) whenever p <= q, uniformly in t > 0. Next we establish that PHI(t,N,P_) is the least [uniformly in t > 0] when the species are equally abundant. This demonstrates further that the average number of successive attempts needed to observe all species is the least when they are equally abundant. Results for a finite closed population will also be stated. Finally, data analysis aspects will be addressed.

.Seminar page

## Alex Trindade, University of Florida

### Practical Small-Sample Inference for Order One Subset Autoregressive Models via Saddlepoint Approximations

We propose some approaches for saddlepoint approximating distributions of estimators in single lag subset autoregressive models of order one. The roots of estimating equations method is the most promising, not requiring that an explicit expression for the estimator be available. We find the distributions of the Burg estimators to be almost coincidental with those of maximum likelihood. By inverting a hypothesis test, we show how confidence intervals for the autoregressive coefficient can be constructed from saddlepoint approximations. A simulation study reveals the resulting coverage probabilities to be very close to nominal levels, and to generally outperform asymptotics-based confidence intervals. The reason is shown to be linked to near parameter orthogonality between the autoregressive coefficient and the white noise variance. Our findings are substantiated by percent relative error calculations that show the saddlepoint approximations to be very accurate.
Seminar page

## Sudipto Banerjee, University of Minnesota

### On Bayesian Wombling: Modelling Spatial Gradients

Spatial process models are now widely used for inference in many areas of application. In such contexts interest often lies in estimating the rate of change of a spatial surface at a given location in a given direction. This problem, known as wombling'' after a foundational paper by William Womble (Science, 1951), is encountered in several scientific disciplines. Examples include temperature or rainfall gradients in meteorology, pollution gradients for environmental data and surface roughness assessment for digital elevation models. Since the spatial surface is viewed as a random realization, all such rates of change are random as well. We formalize the notions of directional finite difference processes and directional derivative processes building upon the concept of mean square differentiability. and obtain complete distribution theory results for stationary Gaussian process models. We present inference under a Bayesian framework which, in this setting, presents several modelling advantages. Illustrations are provided with simple and complex spatial models.
Seminar page

## Dobrin Marchev, University of Florida

### Monte Carlo Methods for Posterior Distributions Associated with Multivariate Student's $t$ Data

Let $\pi$ denote the posterior distribution that results when a random sample of size $n$ from a $d$-dimensional location-scale Student's $t$ distribution (with $\nu$ degrees of freedom) is combined with the standard non-informative prior. We consider several Monte Carlo methods for sampling from the intractable posterior $\pi$, including rejection samplers and Gibbs samplers. Special attention is paid to the MCMC algorithm developed by van Dyk and Meng (JCGS, 2001) who provided considerable empirical evidence suggesting that their algorithm converges to stationarity much faster than the standard data augmentation Gibbs sampler. We formally analyze the relevant Markov chain underlying van Dyk and Meng's algorithm and show that for many $(d,\nu,n)$ triples it's geometrically ergodic.
Seminar page

## Subhashis Ghosal, North Carolina State University

### Bayesian Nonparametric Binary Regression using a Gaussian process prior

Binary response ($Y$) data corresponding to a continuous covariate is often encountered in practice. In the classical parametric approach, the response probability $p(z)=P(Y=1|z)$, as a function of the covariate $z$, is modeled as a known function $H(z,\theta)$, except for a finite dimensional parameter $\theta$. The link function $H$ is usually taken to be a cumulative probability distribution. The parametric approach may however fail to accommodate a variety of data. Nonparametric approach consists of estimating the whole link function. Assuming that the link function $H$ is a cumulative probability distribution, a Bayesian approach can be followed by putting a Dirichlet process prior (or a smoother version of it) on $H$. If $H$ is not a monotone function (which happens in toxicity studies and some other examples), the above approach may lead to serious bias. In this talk, we present a method of estimating the response probability based on local smoothness assumptions only, without any monotonicity restriction. We follow a Bayesian approach by putting a (transformed) Gaussian process prior on the response probabilities. Such a prior has large support in that no shape is ruled out by the prior. We describe a Markov chain Monte Carlo method to compute the posterior mean of the response probability function $p(z)$ and posterior credible bands. If the design points satisfy certain requirements (which are satisfied by the regularly spaced design, or with large probability, by random designs) and the true response probability $p_0(z)$ is twice continuously differentiable and never 0 or 1, we show that the posterior distribution of the respsonse probability function $p$ is consistent for the $L_1$-distance $d(p_1,p_2)=\int |p_1(z)-p_2(z)| dz$. In order to establish this result, we also extend the classical theorem of Schwartz (1965, Z. Wahr. Verw. Gebiete 4, 10--26) to independent non-identically distributed observations. It is shown that consistency holds if the prior assigns positive probabilities to some suitable neighborhoods, and certain uniformly exponentially powerful tests exist on some appropriate sieves. Some deep properties of Gaussian processes are exploited to show that the required conditions hold. A simulation study shows that our method performs well in comparison with other methods of estimation available in the literature. We also apply our method to some real data. The talk is based on a joint work with Nidhan Choudhuri of Case Western Reserve University, and Anindya Roy of University of Maryland, Baltimore County.
Seminar page

## Norman Breslow, University of Washington

### Optimal Design and Efficient Analysis of Two-Phase Case-Control and Case-Cohort Studies

Two phase or double sampling was introduced by Neyman as a technique for drawing efficient stratified samples. We consider this design initially for the case of a binary outcome variable. When strata and phase two sampling fractions depend both on outcome and covariates, care must be taken with the analysis of data from the resulting biased sample. The standard survey sampling (Horvitz-Thompson) approach involves weighting the log-likelihood contributions by the inverse sampling fractions. "Psuedolikelihood" estimates maximize a product of biased sampling probabilities and are typically more efficient. Only recently have semiparametric efficient and maximum (profile) likelihood procedures become available.
We first develop asymptotic information bounds and the form of the efficient score and influence functions for coefficients in semiparametric regression models fitted to two phase stratified samples, and point out the relationship of this work to the information bound calculations of Robins and colleagues. By verifying conditions of Murphy and van der Vaart for a least favorable parametric submodel, we provide asymptotic justification for statistical inference based on profile likelihood.
Using data from the National Wilms Tumor Study, and simulations based on these data, we then demonstrate the advantages of careful selection of the phase two sample and use of an efficient analysis method. One can limit collection of data on an "expensive" covariate to a fraction of the phase one sample, yet almost exactly reproduce the results that would have been obtained with complete data for everyone. The basic principles include: (i) fine stratification of the phase one sample using outcome and available covariates; (ii) use of a near optimal "balanced" rather than a simple case-control sample at phase two; and (iii) fully efficient estimation. These same basic principles extend to the design and analysis of two phase exposure stratified case-cohort studies, which involve a censored "survival" outcome and on which much work is in progress.
Portions of this work are joint with Nilanjan Chatterjee, Brad McNeney and Jon Wellner.
Seminar page

## Challis Lecture, Norman Breslow, University of Washington

### The Value of Long Term Follow-up: Lessons from the National Wilms Tumor Study

Wilms tumor (WT) is an embryonal tumor of the kidney that affects approximately one child in every 10,000. During the 20th century, cure rates increased from 10% to 90% as first radiation and then chemotherapy joined surgical removal of the diseased kidney as standard treatment. The National Wilms Tumor Study Group (NWTSG) conducted five protocol studies and registered nearly 10,000 patients during 1969-2002. For the past 15-20 years the study enrolled 70-80% of the 550 cases estimated to occur annually in North America. Its focus has been the identification of patient subgroups at high and low risk of relapse, and the substitution of combination chemotherapy for radiation therapy, with a primary goal to reduce long term complications while producing the maximum number of cures.
The NWTSG Data and Statistical Center, located in Seattle for the 33 year duration of the study, played a major role in this effort. Systematic follow-up of surviving patients documented the long term "costs of cure" and the wisdom of reserving the most toxic treatments for those who actually needed them. Secondary cancers, for example, which once affected 1.6% of Wilms tumor survivors by 15 years from diagnosis, have been much reduced since 60% of patients no longer receive radiation therapy.
Statistical study of the NWTSG database has challenged prevailing theories for the genetic origins of WT and led to new hypotheses for investigation by molecular biologists.
This talk considers three issues: (1) whether all bilateral and multicentric WT are hereditary; (2) whether Asians lack WT caused by loss of imprinting of the insulin growth factor gene IGF2; and (3) whether constitutional deletion of the WT gene WT1 in patients with the WT-aniridia (WAGR) syndrome has a less severe effect on renal function than a point mutation in WT1 in patients with the Denys-Drash syndome. Key factors that facilitated these statistical contributions include a compulsive effort to maintain continuity in data collection and follow-up and a constant search for ways to use the clinical and epidemiologic data to answer questions of basic biological significance.
Seminar page