Semiparametric regression analysis of mean residual life with censored survival data
As a function of time t, mean residual life is the remaining life
expectancy of a subject given survival up to t. The proportional mean
residual life model, proposed by Oakes & Dasu (1990), provides an
alternative to the Cox proportional hazards model to study the
association between survival times and covariates. In the presence of
censoring, we develop semiparametric inference procedures for the
regression coefficients of the Oakes-Dasu model using martingale
theory for counting processes. We also present simulation studies and
an application to the Veterans' Administration lung cancer data. This
is joint work with Y. Q. Chen.
Seminar page
Methods For Estimating The Spatial Average Over An Irregularly-Shaped Study Region
Using data collected from fishery-independent surveys in the Chesapeake Bay (eastern U.S.), we compare several methods for estimating relative abundance using catch-per-unit-effort (CPUE) data for a study area that is irregular in shape. The methods are: an approximation to block kriging, approximate block kriging in the presence of trend, and design-based estimation based on stratified multistage cluster sampling. We describe a method for estimating the spatial average and its SE using an approximation for block kriging which incorporates a trend component. What makes this work distinctive from universal block kriging is the potential use of covariates other than the spatial indices common in universal kriging and the use of block kriging over an irregular shape. We show that the kriging error for the spatial mean based on the new method is lower than the design-based method for estimating the variance. The method is general and can be applied in other similar situations.
Seminar page
Estimation for Generalized Linear Models When Covariates
Are Subject-specific
Parameters in a Mixed Model for Longitudinal Measurements
The relationship between a primary endpoint and features of longitudinal
profiles of a continuous response is often of interest. One challenge is that
the features of the longitudinal profiles are observed only through the
longitudinal measurements, which are subject to measurement error and other
variation. A relevant framework assumes that the longitudinal data follow a
linear mixed model whose random effects are covariates in a generalized linear
model for the primary endpoint. Methods proposed in literature require a
parametric (normality) assumption on the random effects, which may be
unrealistic. We propose a conditional likelihood approach which requires no
assumptions on the random effects and a semiparametric full likelihood
approach which requires only the assumption that the random effects have a
smooth density. It is straightforward and fast to implement the conditional
likelihood approach. EM algorithm is used in general for implementation of
the semiparametric full likelihood approach and it involves increased
computational burden. Simulation results show that, in contrast to methods
predicated on a parametric (normality) assumption for the random effects, the
approaches yield valid
inferences under departures from this assumption and are competitive when the
assumption holds. The semiparametric full likelihood approach shows some
efficiency gains over the other methods and provides estimates for the
underlying random effects distribution. We also illustrate the performance of
the approaches by application to a study of bone mineral density and
longitudinal progesterone levels in 624 women transitioning to menopause in
which investigators wished to understand the association between osteopenia,
characterized by bone mineral density at or below the
33rd percentile, and features of hormonal patterns over the menstrual cycle in
peri-menopausal women. Data analysis results obtained from the approaches
offer the analyst assurance of credible estimation of the relationship.
Seminar page
A Bayesian Statistical Calibration Accounting for Measurement
Error in the Reference Device
We use a Bayes hierarchical model to enable the statistical calibration of a set of mass-produced devices when the reference instrument measures the true state accurately, but with additive Gaussian error. We show how to arrive at the estimates of the calibration parameters for the tested devices, the prediction of the same for an untested device, and the predictions of the true state given an observation from both a tested and an untested device via MCMC. A real example of a calibration experiment involving 12 resistance thermocouple devices (RTDs) and a NIST-approved accurate thermometer is provided.
Seminar page
Combining Information from Multiple Surveys for Small Area Estimation: A
Bayesian Approach
Cancer surveillance research requires accurate and precise estimates of
the prevalence of cancer risk factors and screening for small areas such
as counties. Two popular data sources are the Behavioral Risk Factor
Surveillance System (BRFSS) and the National Health Interview Survey
(NHIS). Both data sources have advantages and disadvantages. The BRFSS is
a larger, and almost every county is included in the survey; but it has
lower response rates and, being a telephone survey, it does not include
subjects who live in households with no telephones. On the other hand, the
NHIS is a smaller survey, with the majority of counties not included; but
it includes both telephone and non-telephone households and has higher
response rates. A preliminary analysis shows that the distributions of
cancer screening and risk factors are different for telephone and
non-telephone households. Thus, information from the two surveys may have
to be combined to address both nonresponse and noncoverage errors. A
hierarchical Bayesian approach is used to combine information from both
surveys to construct county-level estimates. The proposed model
incorporates potential noncoverage and nonresponse biases in the BRFSS as
well as complex sample design features of both surveys. A Markov Chain
Monte Carlo method is used to simulate draws from the joint posterior
distribution of unknown quantities in the model based on the design-based
direct estimates and county-level covariates. Yearly county-level
prevalence estimates for 50 states (including D. C.), and the whole state
of Alaska, are developed for ten outcomes using BRFSS and NHIS data from
1997-2000. The outcomes include smoking and common cancer screening
procedures. The NHIS/BRFSS combined county estimates are substantially
different from those based on BRFSS alone.
Seminar page
Statistical methods for sample classification and prediction with
microarray gene expression data
sing gene expression data to classify sample types or patient survivals
has received much research attention recently. To accomodate special
features of gene expression data, several new methods have been
proposed, including a weighted voting scheme of Golub et al
(1999), a compound covariate method of Hedenfalk et al (2001)
(originally proposed by Tukey (1993)), and a shrunken centroids
method of Tibshirani et al (2002). These methods look different
and are more or less ad hoc. Here we point out a close
connection of the three methods with a linear regression model
and partial least squares (PLS). Under the general framework of
PLS, we propose a penalized PLS (PPLS) method that can handle
both categorical (for classification) and continuous (e.g. survival
times) responses. Using real data, we show the competitive performance
of our proposal when compared with other methods.
This is a joint work with Wei Pan (Biostatistics, U of
Minnesota) and Jennifer Hall (Medicine, U of Minnesota).
Seminar page
Randomization Inference in Covariance Adjustment in a Randomized Controlled
Trial with Incomplete Longitudinal Data
A recent proposal for randomization inference in covariance adjustment offers
the option to control for baseline imbalances with various regression methods
while preserving the framework of a randomization test. Applied to a
randomized controlled trial, narrower confidence intervals are achieved
through adjusting for baseline differences. The method will be illustrated in
a clinical trial of treatments following childhood cancer, with incomplete
longitudinal data from a thick tailed multivariate distribution.
Seminar page
Marginal Regression Modeling of Longitudinal, Categorical Response Data
Longitudinal regression analysis is important in a variety of settings when
the goal is to characterize changes that occur over time. The focus of this
talk is on marginal regression models for longitudinal, categorical response
data. I will first discuss a consistency-efficiency tradeoff with
semi-parametric modeling when the goal is to estimate the cross-sectional
relationship between the response and an exposure E[Y(t) | X(t)]. Next, I
will describe the "marginalized" model class which permits likelihood-based
estimation of marginal regression parameters. I will extend this class to
accomodate response dependence that I have seen with long series of response
data (the functional form of response dependence has both serial and
long-range components). Finally, I will discuss prospective inference with
outcome dependent sampling. One situation where such a sampling scheme might
be important is in a study where interest is in estimating the relationship
between a response and a time-varying exposure, the exposure is expensive to
measure, and a number of subjects exhibited no response variation during the
study period (e.g., never had symptoms). With this sampling design, under
certain conditions, we are able to make valid inference that is efficient when
we exclude subjects without response variation as long as we account for
the covariate ascertainment mechanism.
Seminar page
Semiparametric Spatial Modeling of Binary Outcomes,
With Application to Aberrant Crypt Foci in Colon Carcinogenesis Experiments
Our work is directed towards the analysis of aberrant crypt
foci (ACF) in colon carcinogenesis.
ACF are morphologically changed colonic crypts that are known to be
precursors of colon cancer development. In our experiment, all animals
were exposed to a carcinogen, and some were exposed to radiation.
The colon is laid out as a rectangle, much longer than it is wide
(hence the longitudinal aspect),
the rectangle is gridded, and the occurrence of an ACF within the grid
is noted. The biological question of interest is whether these binary
responses occur at random through the colon: if not, this suggests
that the effect of environmental exposures is localized in
different regions. Assuming that there are correlations in the
locations of the ACF, the questions are how great are these correlations,
and whether the correlation structures differ when an animal is exposed
to radiation.
Initially, we test for the existence of correlation. We derive the
score test for conditionally autoregressive (CAR) correlation models, and
show that this test arises as well from a modification of the
score test for M\'atern correlation models. Robust methods are used
to lower the sensitivity to regions where there are few ACF.
To understand the extent of the correlation, we cast the
problem as a spatial binary regression,
where binary responses arise from an underlying Gaussian latent
process. The use of such latent processes in spatial
problems has found widespread acceptance in
public health, ecological research and
environmental monitoring. Our data are clearly nonstationary,
with marginal probabilities of disease depending strongly on the
location within the colon: we model these marginal probabilities
semiparametrically, using fixed-knot penalized regression splines
and single-index models.
We also believe that the underlying latent process is nonstationary,
and we model this based on the convolution of latent local
stationary processes. The dependency of the correlation
function on location is also modeled semiparametrically.
We fit the models using pairwise pseudolikelihood methods.
Assuming that the underlying latent process is strongly mixing, known
to be the case for many Gaussian processes,
we prove asymptotic normality of the methods. The penalized regression
splines have penalty parameters that must converge to zero asymptotically:
we derive rates for these parameters that do and do not lead to
an asymptotic bias, and we derive the optimal rate
of convergence for them. Finally, we apply the methods to the data
from our experiment.
Seminar page
Low-Level Analysis of High-Density Oligonucleotide Microarray
Data
Microarray experiments are becoming widely used for many biomedical
applications. After a brief introduction to the Affymetrix GeneChip
microarray platform, I plan to describe how a gene expression measure
might be constructed. Specifically, I will discuss a three-stage
process: Background Adjustment, Normalization and Summarization.
The focus will be on how each step affects the specificity and
precision of the computed expression measure, as well as the ability
to detect differential expression. A great deal of our discussion will
be in the context of the Robust Multi-chip Average (RMA) expression
measure.
If time permits, I will briefly examine some extensions of the RMA
method. In particular, I will present some preliminary results showing
how
probe-level models, rather than expression measures, may be used for
differential expression detection.
Seminar page
1) Modes and Clustering for Time-Warped Gene Expression Profile Data
> and 2) Random Forest-based Pre-validation Applied to Tissue
Microarray Data
This talk is comprised of two parts.
In the first part, I will talk about my Ph.D. thesis work. We propose a
functional convex synchronization model, under the premise that each
observed curve is the realization of a stochastic process. Monotonicity
constraints on time evolution provide the motivation for a functional convex
calculus with the goal of obtaining sample statistics such as a functional
mean. We derive a functional limit theorem and asymptotic confidence
intervals for functional convex means. This nonparametric time-synchronized
algorithm is also combined with an iterative mean updating technique to find
an overall representation that corresponds to a mode of a sample of gene
expression profiles, viewed as a random sample in function space.
In the second part, I will talk about novel statistical methods for the
analysis of tissue microarray data. Tissue microarrays (TMAs) represent a
high throughput tool for studying protein expression patterns in tissue
specimens. In performing TMA analysis, the tissue is immunohistochemically
stained and scored by a pathologist based on tumor marker staining scores.
It is standard practice to select a single staining cutoff that stratifies
the population based on an endpoint of interest. However, if the
dichotomized staining score is included in a Cox model that uses the same
outcome that was used to dichotomize the staining data, the significance of
the biomarkers may be overstated.
We introduce a new method (random forest pre-validation) that circumvents
this bias problem. The idea is to summarize all staining scores into a
single scalar M which can be used as covariate in a Cox regression model. We
demonstrate the use of this method to assess the prognostic significance of
eight biomarkers for predicting survival in patients with renal cell
carcinoma. Our proposed method avoids problems associated with
multi-collinearity and over-fitting. We also carry out a cross-validation
scheme to compare the predictive power of different prognostic models.
Seminar page
Semiparametric and nonparametric models for longitudinal data
Estimating equation approaches are useful for correlated data because
the likelihood function is often unknown or intractable. However,
estimating equation approaches lack (1) objective functions for
selecting the correct root in multiple root problems,
and (2) likelihood-type functions to produce inference functions.
In this talk, a general description is given of the quadratic inference
function approach, a semiparametric framework defined
by a set of mean zero estimating functions, but differing from the
standard estimating function approach in that there are more equations
than the number of unknown parameters. The quadratic inference
function method provides efficient and robust estimation of parameters
in longitudinal data settings, and inference functions for testing.
Further, an efficient estimator using a nonparametric regression
spline is developed, and a goodness-of-fit test is introduced. The
asymptotic chi-squared test is also useful for testing whether
coefficients in nonparametric regression are time-varying
or time invariant.
Seminar page
Nonparametric Intensity Curve Estimation with Applications to
Neuroscience
Motivated by a neuroscience experiment which observes spike trains
from the primary motor cortex of Macaca Mulatta (rhesus monkey), we
develop methods for estimating the intensity function of a Poisson
point process corresponding to a single spike train, and for
estimating families of intensity functions that have a common
(unknown) shape or amplitude. Additionally, we provide tests for a
breakpoint in an intensity function at a given location. These
methods are based on local likelihood smoothing. Asymptotic
properties of the intensity estimate and test statistics for
breakpoints are discussed. We also present results from simulation
studies which describe the power and actual significance levels of our
tests. Estimates for families of intensity functions build on
Functional Data Analysis methodology, but extend beyond the current
procedures. In particular, our methods do not require that the point
process be observed on the full support of the intensity function for
each member of the family. We show that for this case, local
likelihood methodology corresponds to using a local polynomial fit
with adjusted kernel weights.
Seminar page
Analysis of Censored Lifetime Medical Cost
Cost assessment is an important component in comprehensive medical treatment
evaluation. It has now become an accepted and often required adjunct to
standard safety and efficacy assessment. However, its statistical analysis is
challenged by irregular cost distribution and by incomplete follow-up,
particularly the latter with limited study duration as is typical in practice.
In this talk, I address the regression analysis by modeling not only lifetime
medical cost but also survival time, in a semi-parametric fashion. Both
outcomes, on possibly transformed scales, linearly relate to the covariates;
however, the bivariate model error distribution is unspecified. With this
bivariate generalization of the accelerated failure time model, I propose an
inference procedure by extending the weighted log-rank estimating function to
the marked point process framework. In addition, I suggest a novel
sample-based variance estimation procedure for estimators based on non-smooth
estimating functions. This proposal is applied to a recent lung cancer trial.
Further developments as well as on-going research with alternative
semi-parametric modeling will also be described.
Seminar page
Assessing the effect of reproductive hormone profile on bone mineral
density using functional two-stage mixed models
In the Study of Women's Health Across the Nation (SWAN), total hip bone
mineral density (BMD) was measured together with repeated measures of
the levels of creatinine-adjusted follicle stimulating hormone (FSH)
collected daily in urine over one menstrual cycle on more than 600 pre-
and perimenopausal women. It was of scientific interest to investigate
the effect of the FSH time profile in a menstrual cycle on the total
hip BMD, adjusting for age and body mass index. The statistical analysis
is challenged by several features of the data. (1) The covariate FSH is
measured longitudinally and its effect on the scalar outcome BMD is
complex.
(2) Due to varying menstrual cycle lengths, women have unbalanced
longitudinal measures of FSH. (3) The longitudinal measures of FSH are
subject to considerable among- and within-woman variations and measurement
errors. We propose a measurement error partial functional linear model,
where
repeated measures of FSH are modeled using functional mixed effects models
and the effect of the FSH time profile on BMD is modeled using a partial
functional linear model by treating the unobserved true woman-specific FSH
time profile as a functional covariate. We develop a two-stage estimation
procedure using periodic smoothing splines. Using the connection between
smoothing splines and mixed models, a key feature of our approach is
that estimation at both stages could be conveniently cast into
a unified mixed model framework. A simple test for constant functional
covariate effect is also proposed. The proposed method is evaluated using
simulation studies and applied to the SWAN data.
This is a joint work with Xihong Lin and Mary Fran Sowers of The
University of Michigan.
Seminar page
Conjugate Dirichlet Process Mixture Models: Gene Expression,
Efficient Sampling, and Clustering
This talk proposes a novel conjugate Dirichlet process mixture (DPM)
model for the analysis of gene expression data, introduces a new MCMC
sampling algorithm for fitting general conjugate DPM models, and
describes a quick mode-finding algorithm for clustering in a particular
class of conjugate DPM models. Since biologists are typically interested
in expression patterns over a variety of treatment conditions, the
proposed model clusters genes having similar patterns of expression
(instead of similar levels of expression) and naturally incorporates any
number of treatment conditions. Further, hypotheses are easily tested
and false discovery rates are readily estimated. The second part of the
talk addresses formidable computational issues arising in the use of DPM
models by introducing a new MCMC sampling algorithm for any (not just
the gene expression model) conjugate DPM model. Simulations indicate
that the proposed sampler can be significantly faster than existing
methods. The new algorithm is a merge-split sampler which uses ideas
similar to those in sequential importance sampling. Finally, in the case
of two treatment conditions, a very quick clustering algorithm is
introduced which is guaranteed to find the mode of the posterior
clustering distribution in a class of conjugate DPM models. Pre-prints
are available at
http://www.stat.wisc.edu/~dbdahl.
Seminar page
Effect of Length Biased Sampled Sojourn Times on the Survival Distribution
from Screen-Detected Diseases
Screen-detected cases form a length-biased sample among all cases of disease,
since longer sojourn times afford more opportunity to be screen-detected. In
contrast to the usual length-biased sampling situation, however, the
length-biased sojourns (preclinical durations) are never observed, but their
subsequent clinical durations are. We investigate the effect of length-biased
sampling of sojourn times on the distribution (mainly, the mean and variance)
of the observed clinical durations. We show that, when preclinical and
clinical durations are positively correlated, the mean clinical duration can
be substantially inflated --- even in the absence of any benefit on survival
from the screening procedure. Consequently, the mean survival among cases in
the screen-detected arm of a randomized screening trial will be longer than
that among interval cases or cases that arise in the control arm, simply
because of the length-bias phenomenon. We briefly discuss issues related to
estimating the inflation.
(This work was performed in collaboration with Philip C. Prorok while Dr.
Kafadar was a Guest Researcher in the Biometry Branch.)
Seminar page
Multivariate Nonparametric Tests
Affine invariant spatial sign and rank tests and related location estimates are described for the
one-sample and several-samples multivariate location problems and for testing for independence. These methods generalize the classical methods for univariate problem settings. They are easy to compute for data in common dimensions and have excellent asymptotic relative efficiencies.
Seminar page
Survival models for cardiovascular diseases
In the survival analysis of cardiovascular diseases the cause of death may
be one or more of several reasons. Also there are different states of a
patient from which the patient can move only to a few other states.
For example, congestive heart failure surely leads to death and the
patient cannot go back to normal state. Some of the conditions of the
patient can lead to other conditions. Some conditions of the patient may
recur as well. In this situation times to different events are
correlated. It is important to investigate the relationship between
different causes for death and different types of events and their joint
impact on the patient. In this setup, a competing risks and multi-state
model with correlated failure times is appropriate. In this talk I will
discuss modeling of cardiovascular disease data and investigate the
relationship between different causes of death. Data from Frammingham
study will be used.
Seminar page
From Independence to Adaptive Nonparametric Methods
We first consider some "old" methods of spotting independent statistics.
One of these, Basu's Theorem, depends upon complete sufficient
statistics
for parameters but can be used in a nonparametric setting. It is from
this
use that it is easy to construct adaptive nonparametric tests that have
much
better power than do the traditional tests over a wide range of
underlying
distributions. Adaptive distribution-free classification is also
considered. [The author reminds the audience that he is old and may
not know what he is talking about; but if he did, he would be right!]
Seminar page
Wavelet and SiZer Analyses of Internet Traffic Data: An
Overview of SAMSI Research
It is important to characterize burstiness of Internet
traffic and find the causes for building models that can mimic real
traffic. To achieve this goal, exploratory analysis tools and
statistical tests are needed, along with new models for aggregated
traffic. This talk introduces statistical tools based on wavelets and
SiZer (SIgnificance of ZERo crossings of the derivative). The
intricate fluctuations of Internet traffic are explored in various
respects and lessons from real data analyses are summarized.
Seminar page