Su-Chun Cheng, University of California San Francisco

Semiparametric regression analysis of mean residual life with censored survival data

As a function of time t, mean residual life is the remaining life expectancy of a subject given survival up to t. The proportional mean residual life model, proposed by Oakes & Dasu (1990), provides an alternative to the Cox proportional hazards model to study the association between survival times and covariates. In the presence of censoring, we develop semiparametric inference procedures for the regression coefficients of the Oakes-Dasu model using martingale theory for counting processes. We also present simulation studies and an application to the Veterans' Administration lung cancer data. This is joint work with Y. Q. Chen.
Seminar page

Mary Christman, University of Maryland

Methods For Estimating The Spatial Average Over An Irregularly-Shaped Study Region


Using data collected from fishery-independent surveys in the Chesapeake Bay (eastern U.S.), we compare several methods for estimating relative abundance using catch-per-unit-effort (CPUE) data for a study area that is irregular in shape. The methods are: an approximation to block kriging, approximate block kriging in the presence of trend, and design-based estimation based on stratified multistage cluster sampling. We describe a method for estimating the spatial average and its SE using an approximation for block kriging which incorporates a trend component. What makes this work distinctive from universal block kriging is the potential use of covariates other than the spatial indices common in universal kriging and the use of block kriging over an irregular shape. We show that the kriging error for the spatial mean based on the new method is lower than the design-based method for estimating the variance. The method is general and can be applied in other similar situations.
Seminar page

Erning Li, North Carolina State University

Estimation for Generalized Linear Models When Covariates Are Subject-specific Parameters in a Mixed Model for Longitudinal Measurements


The relationship between a primary endpoint and features of longitudinal profiles of a continuous response is often of interest. One challenge is that the features of the longitudinal profiles are observed only through the longitudinal measurements, which are subject to measurement error and other variation. A relevant framework assumes that the longitudinal data follow a linear mixed model whose random effects are covariates in a generalized linear model for the primary endpoint. Methods proposed in literature require a parametric (normality) assumption on the random effects, which may be unrealistic. We propose a conditional likelihood approach which requires no assumptions on the random effects and a semiparametric full likelihood approach which requires only the assumption that the random effects have a smooth density. It is straightforward and fast to implement the conditional likelihood approach. EM algorithm is used in general for implementation of the semiparametric full likelihood approach and it involves increased computational burden. Simulation results show that, in contrast to methods predicated on a parametric (normality) assumption for the random effects, the approaches yield valid inferences under departures from this assumption and are competitive when the assumption holds. The semiparametric full likelihood approach shows some efficiency gains over the other methods and provides estimates for the underlying random effects distribution. We also illustrate the performance of the approaches by application to a study of bone mineral density and longitudinal progesterone levels in 624 women transitioning to menopause in which investigators wished to understand the association between osteopenia, characterized by bone mineral density at or below the 33rd percentile, and features of hormonal patterns over the menstrual cycle in peri-menopausal women. Data analysis results obtained from the approaches offer the analyst assurance of credible estimation of the relationship.
Seminar page

Reid Landes, Iowa State University.

A Bayesian Statistical Calibration Accounting for Measurement Error in the Reference Device

We use a Bayes hierarchical model to enable the statistical calibration of a set of mass-produced devices when the reference instrument measures the true state accurately, but with additive Gaussian error. We show how to arrive at the estimates of the calibration parameters for the tested devices, the prediction of the same for an untested device, and the predictions of the true state given an observation from both a tested and an untested device via MCMC. A real example of a calibration experiment involving 12 resistance thermocouple devices (RTDs) and a NIST-approved accurate thermometer is provided.
Seminar page

Dawei Xie, University of Michigan.

Combining Information from Multiple Surveys for Small Area Estimation: A Bayesian Approach

Cancer surveillance research requires accurate and precise estimates of the prevalence of cancer risk factors and screening for small areas such as counties. Two popular data sources are the Behavioral Risk Factor Surveillance System (BRFSS) and the National Health Interview Survey (NHIS). Both data sources have advantages and disadvantages. The BRFSS is a larger, and almost every county is included in the survey; but it has lower response rates and, being a telephone survey, it does not include subjects who live in households with no telephones. On the other hand, the NHIS is a smaller survey, with the majority of counties not included; but it includes both telephone and non-telephone households and has higher response rates. A preliminary analysis shows that the distributions of cancer screening and risk factors are different for telephone and non-telephone households. Thus, information from the two surveys may have to be combined to address both nonresponse and noncoverage errors. A hierarchical Bayesian approach is used to combine information from both surveys to construct county-level estimates. The proposed model incorporates potential noncoverage and nonresponse biases in the BRFSS as well as complex sample design features of both surveys. A Markov Chain Monte Carlo method is used to simulate draws from the joint posterior distribution of unknown quantities in the model based on the design-based direct estimates and county-level covariates. Yearly county-level prevalence estimates for 50 states (including D. C.), and the whole state of Alaska, are developed for ten outcomes using BRFSS and NHIS data from 1997-2000. The outcomes include smoking and common cancer screening procedures. The NHIS/BRFSS combined county estimates are substantially different from those based on BRFSS alone.
Seminar page

Xiaohong Huang, University of Minnesota.

Statistical methods for sample classification and prediction with microarray gene expression data


sing gene expression data to classify sample types or patient survivals has received much research attention recently. To accomodate special features of gene expression data, several new methods have been proposed, including a weighted voting scheme of Golub et al (1999), a compound covariate method of Hedenfalk et al (2001) (originally proposed by Tukey (1993)), and a shrunken centroids method of Tibshirani et al (2002). These methods look different and are more or less ad hoc. Here we point out a close connection of the three methods with a linear regression model and partial least squares (PLS). Under the general framework of PLS, we propose a penalized PLS (PPLS) method that can handle both categorical (for classification) and continuous (e.g. survival times) responses. Using real data, we show the competitive performance of our proposal when compared with other methods. This is a joint work with Wei Pan (Biostatistics, U of Minnesota) and Jennifer Hall (Medicine, U of Minnesota).
Seminar page

Robert Greevy, University of Pennsylvania.

Randomization Inference in Covariance Adjustment in a Randomized Controlled Trial with Incomplete Longitudinal Data


A recent proposal for randomization inference in covariance adjustment offers the option to control for baseline imbalances with various regression methods while preserving the framework of a randomization test. Applied to a randomized controlled trial, narrower confidence intervals are achieved through adjusting for baseline differences. The method will be illustrated in a clinical trial of treatments following childhood cancer, with incomplete longitudinal data from a thick tailed multivariate distribution.
Seminar page

Jonathan Schildcrout, University of Washington.

Marginal Regression Modeling of Longitudinal, Categorical Response Data


Longitudinal regression analysis is important in a variety of settings when the goal is to characterize changes that occur over time. The focus of this talk is on marginal regression models for longitudinal, categorical response data. I will first discuss a consistency-efficiency tradeoff with semi-parametric modeling when the goal is to estimate the cross-sectional relationship between the response and an exposure E[Y(t) | X(t)]. Next, I will describe the "marginalized" model class which permits likelihood-based estimation of marginal regression parameters. I will extend this class to accomodate response dependence that I have seen with long series of response data (the functional form of response dependence has both serial and long-range components). Finally, I will discuss prospective inference with outcome dependent sampling. One situation where such a sampling scheme might be important is in a study where interest is in estimating the relationship between a response and a time-varying exposure, the exposure is expensive to measure, and a number of subjects exhibited no response variation during the study period (e.g., never had symptoms). With this sampling design, under certain conditions, we are able to make valid inference that is efficient when we exclude subjects without response variation as long as we account for the covariate ascertainment mechanism.
Seminar page

Tanya Apanosovich, Texas A and M university.

Semiparametric Spatial Modeling of Binary Outcomes, With Application to Aberrant Crypt Foci in Colon Carcinogenesis Experiments


Our work is directed towards the analysis of aberrant crypt foci (ACF) in colon carcinogenesis. ACF are morphologically changed colonic crypts that are known to be precursors of colon cancer development. In our experiment, all animals were exposed to a carcinogen, and some were exposed to radiation. The colon is laid out as a rectangle, much longer than it is wide (hence the longitudinal aspect), the rectangle is gridded, and the occurrence of an ACF within the grid is noted. The biological question of interest is whether these binary responses occur at random through the colon: if not, this suggests that the effect of environmental exposures is localized in different regions. Assuming that there are correlations in the locations of the ACF, the questions are how great are these correlations, and whether the correlation structures differ when an animal is exposed to radiation. Initially, we test for the existence of correlation. We derive the score test for conditionally autoregressive (CAR) correlation models, and show that this test arises as well from a modification of the score test for M\'atern correlation models. Robust methods are used to lower the sensitivity to regions where there are few ACF. To understand the extent of the correlation, we cast the problem as a spatial binary regression, where binary responses arise from an underlying Gaussian latent process. The use of such latent processes in spatial problems has found widespread acceptance in public health, ecological research and environmental monitoring. Our data are clearly nonstationary, with marginal probabilities of disease depending strongly on the location within the colon: we model these marginal probabilities semiparametrically, using fixed-knot penalized regression splines and single-index models. We also believe that the underlying latent process is nonstationary, and we model this based on the convolution of latent local stationary processes. The dependency of the correlation function on location is also modeled semiparametrically. We fit the models using pairwise pseudolikelihood methods. Assuming that the underlying latent process is strongly mixing, known to be the case for many Gaussian processes, we prove asymptotic normality of the methods. The penalized regression splines have penalty parameters that must converge to zero asymptotically: we derive rates for these parameters that do and do not lead to an asymptotic bias, and we derive the optimal rate of convergence for them. Finally, we apply the methods to the data from our experiment.
Seminar page

Benjamin Bolstad, University of California, Berkeley.

Low-Level Analysis of High-Density Oligonucleotide Microarray Data

Microarray experiments are becoming widely used for many biomedical applications. After a brief introduction to the Affymetrix GeneChip microarray platform, I plan to describe how a gene expression measure might be constructed. Specifically, I will discuss a three-stage process: Background Adjustment, Normalization and Summarization. The focus will be on how each step affects the specificity and precision of the computed expression measure, as well as the ability to detect differential expression. A great deal of our discussion will be in the context of the Robust Multi-chip Average (RMA) expression measure. If time permits, I will briefly examine some extensions of the RMA method. In particular, I will present some preliminary results showing how probe-level models, rather than expression measures, may be used for differential expression detection.
Seminar page

Xueli Liu, University of California, Los Angeles.

1) Modes and Clustering for Time-Warped Gene Expression Profile Data > and 2) Random Forest-based Pre-validation Applied to Tissue Microarray Data


This talk is comprised of two parts. In the first part, I will talk about my Ph.D. thesis work. We propose a functional convex synchronization model, under the premise that each observed curve is the realization of a stochastic process. Monotonicity constraints on time evolution provide the motivation for a functional convex calculus with the goal of obtaining sample statistics such as a functional mean. We derive a functional limit theorem and asymptotic confidence intervals for functional convex means. This nonparametric time-synchronized algorithm is also combined with an iterative mean updating technique to find an overall representation that corresponds to a mode of a sample of gene expression profiles, viewed as a random sample in function space. In the second part, I will talk about novel statistical methods for the analysis of tissue microarray data. Tissue microarrays (TMAs) represent a high throughput tool for studying protein expression patterns in tissue specimens. In performing TMA analysis, the tissue is immunohistochemically stained and scored by a pathologist based on tumor marker staining scores. It is standard practice to select a single staining cutoff that stratifies the population based on an endpoint of interest. However, if the dichotomized staining score is included in a Cox model that uses the same outcome that was used to dichotomize the staining data, the significance of the biomarkers may be overstated. We introduce a new method (random forest pre-validation) that circumvents this bias problem. The idea is to summarize all staining scores into a single scalar M which can be used as covariate in a Cox regression model. We demonstrate the use of this method to assess the prognostic significance of eight biomarkers for predicting survival in patients with renal cell carcinoma. Our proposed method avoids problems associated with multi-collinearity and over-fitting. We also carry out a cross-validation scheme to compare the predictive power of different prognostic models.
Seminar page

Annie Qu, Oregon State University

Semiparametric and nonparametric models for longitudinal data


Estimating equation approaches are useful for correlated data because the likelihood function is often unknown or intractable. However, estimating equation approaches lack (1) objective functions for selecting the correct root in multiple root problems, and (2) likelihood-type functions to produce inference functions. In this talk, a general description is given of the quadratic inference function approach, a semiparametric framework defined by a set of mean zero estimating functions, but differing from the standard estimating function approach in that there are more equations than the number of unknown parameters. The quadratic inference function method provides efficient and robust estimation of parameters in longitudinal data settings, and inference functions for testing. Further, an efficient estimator using a nonparametric regression spline is developed, and a goodness-of-fit test is introduced. The asymptotic chi-squared test is also useful for testing whether coefficients in nonparametric regression are time-varying or time invariant.
Seminar page

Matt Gregas, University of Minnesota

Nonparametric Intensity Curve Estimation with Applications to Neuroscience


Motivated by a neuroscience experiment which observes spike trains from the primary motor cortex of Macaca Mulatta (rhesus monkey), we develop methods for estimating the intensity function of a Poisson point process corresponding to a single spike train, and for estimating families of intensity functions that have a common (unknown) shape or amplitude. Additionally, we provide tests for a breakpoint in an intensity function at a given location. These methods are based on local likelihood smoothing. Asymptotic properties of the intensity estimate and test statistics for breakpoints are discussed. We also present results from simulation studies which describe the power and actual significance levels of our tests. Estimates for families of intensity functions build on Functional Data Analysis methodology, but extend beyond the current procedures. In particular, our methods do not require that the point process be observed on the full support of the intensity function for each member of the family. We show that for this case, local likelihood methodology corresponds to using a local polynomial fit with adjusted kernel weights.
Seminar page

Eugene Huang, Fred Hutchinson Cancer Research Center

Analysis of Censored Lifetime Medical Cost

Cost assessment is an important component in comprehensive medical treatment evaluation. It has now become an accepted and often required adjunct to standard safety and efficacy assessment. However, its statistical analysis is challenged by irregular cost distribution and by incomplete follow-up, particularly the latter with limited study duration as is typical in practice. In this talk, I address the regression analysis by modeling not only lifetime medical cost but also survival time, in a semi-parametric fashion. Both outcomes, on possibly transformed scales, linearly relate to the covariates; however, the bivariate model error distribution is unspecified. With this bivariate generalization of the accelerated failure time model, I propose an inference procedure by extending the weighted log-rank estimating function to the marked point process framework. In addition, I suggest a novel sample-based variance estimation procedure for estimators based on non-smooth estimating functions. This proposal is applied to a recent lung cancer trial. Further developments as well as on-going research with alternative semi-parametric modeling will also be described.
Seminar page

Daowen Zhang, North Carolina State University

Assessing the effect of reproductive hormone profile on bone mineral density using functional two-stage mixed models

In the Study of Women's Health Across the Nation (SWAN), total hip bone mineral density (BMD) was measured together with repeated measures of the levels of creatinine-adjusted follicle stimulating hormone (FSH) collected daily in urine over one menstrual cycle on more than 600 pre- and perimenopausal women. It was of scientific interest to investigate the effect of the FSH time profile in a menstrual cycle on the total hip BMD, adjusting for age and body mass index. The statistical analysis is challenged by several features of the data. (1) The covariate FSH is measured longitudinally and its effect on the scalar outcome BMD is complex. (2) Due to varying menstrual cycle lengths, women have unbalanced longitudinal measures of FSH. (3) The longitudinal measures of FSH are subject to considerable among- and within-woman variations and measurement errors. We propose a measurement error partial functional linear model, where repeated measures of FSH are modeled using functional mixed effects models and the effect of the FSH time profile on BMD is modeled using a partial functional linear model by treating the unobserved true woman-specific FSH time profile as a functional covariate. We develop a two-stage estimation procedure using periodic smoothing splines. Using the connection between smoothing splines and mixed models, a key feature of our approach is that estimation at both stages could be conveniently cast into a unified mixed model framework. A simple test for constant functional covariate effect is also proposed. The proposed method is evaluated using simulation studies and applied to the SWAN data. This is a joint work with Xihong Lin and Mary Fran Sowers of The University of Michigan.
Seminar page

David Dahl, University of Wisconsin

Conjugate Dirichlet Process Mixture Models: Gene Expression, Efficient Sampling, and Clustering

This talk proposes a novel conjugate Dirichlet process mixture (DPM) model for the analysis of gene expression data, introduces a new MCMC sampling algorithm for fitting general conjugate DPM models, and describes a quick mode-finding algorithm for clustering in a particular class of conjugate DPM models. Since biologists are typically interested in expression patterns over a variety of treatment conditions, the proposed model clusters genes having similar patterns of expression (instead of similar levels of expression) and naturally incorporates any number of treatment conditions. Further, hypotheses are easily tested and false discovery rates are readily estimated. The second part of the talk addresses formidable computational issues arising in the use of DPM models by introducing a new MCMC sampling algorithm for any (not just the gene expression model) conjugate DPM model. Simulations indicate that the proposed sampler can be significantly faster than existing methods. The new algorithm is a merge-split sampler which uses ideas similar to those in sequential importance sampling. Finally, in the case of two treatment conditions, a very quick clustering algorithm is introduced which is guaranteed to find the mode of the posterior clustering distribution in a class of conjugate DPM models. Pre-prints are available at
http://www.stat.wisc.edu/~dbdahl.
Seminar page

Karen Kafadar, University of Colorado, Denver.

Effect of Length Biased Sampled Sojourn Times on the Survival Distribution from Screen-Detected Diseases

Screen-detected cases form a length-biased sample among all cases of disease, since longer sojourn times afford more opportunity to be screen-detected. In contrast to the usual length-biased sampling situation, however, the length-biased sojourns (preclinical durations) are never observed, but their subsequent clinical durations are. We investigate the effect of length-biased sampling of sojourn times on the distribution (mainly, the mean and variance) of the observed clinical durations. We show that, when preclinical and clinical durations are positively correlated, the mean clinical duration can be substantially inflated --- even in the absence of any benefit on survival from the screening procedure. Consequently, the mean survival among cases in the screen-detected arm of a randomized screening trial will be longer than that among interval cases or cases that arise in the control arm, simply because of the length-bias phenomenon. We briefly discuss issues related to estimating the inflation. (This work was performed in collaboration with Philip C. Prorok while Dr. Kafadar was a Guest Researcher in the Biometry Branch.)
Seminar page

Ron Randles, University of Florida.

Multivariate Nonparametric Tests

Affine invariant spatial sign and rank tests and related location estimates are described for the one-sample and several-samples multivariate location problems and for testing for independence. These methods generalize the classical methods for univariate problem settings. They are easy to compute for data in common dimensions and have excellent asymptotic relative efficiencies.
Seminar page

Somesh Chattopadhyay, Florida State University.

Survival models for cardiovascular diseases

In the survival analysis of cardiovascular diseases the cause of death may be one or more of several reasons. Also there are different states of a patient from which the patient can move only to a few other states. For example, congestive heart failure surely leads to death and the patient cannot go back to normal state. Some of the conditions of the patient can lead to other conditions. Some conditions of the patient may recur as well. In this situation times to different events are correlated. It is important to investigate the relationship between different causes for death and different types of events and their joint impact on the patient. In this setup, a competing risks and multi-state model with correlated failure times is appropriate. In this talk I will discuss modeling of cardiovascular disease data and investigate the relationship between different causes of death. Data from Frammingham study will be used.
Seminar page

Robert Hogg, University of Iowa.

From Independence to Adaptive Nonparametric Methods

We first consider some "old" methods of spotting independent statistics. One of these, Basu's Theorem, depends upon complete sufficient statistics for parameters but can be used in a nonparametric setting. It is from this use that it is easy to construct adaptive nonparametric tests that have much better power than do the traditional tests over a wide range of underlying distributions. Adaptive distribution-free classification is also considered. [The author reminds the audience that he is old and may not know what he is talking about; but if he did, he would be right!]
Seminar page

Cheolwoo Park, Statistical and Applied Mathematical Sciences Institute. .

Wavelet and SiZer Analyses of Internet Traffic Data: An Overview of SAMSI Research

It is important to characterize burstiness of Internet traffic and find the causes for building models that can mimic real traffic. To achieve this goal, exploratory analysis tools and statistical tests are needed, along with new models for aggregated traffic. This talk introduces statistical tools based on wavelets and SiZer (SIgnificance of ZERo crossings of the derivative). The intricate fluctuations of Internet traffic are explored in various respects and lessons from real data analyses are summarized.
Seminar page