Explained Variation in Survival Analysis and Hypothesis Testing for Current Leukemia-Free Survival
Two topics were investigated in my dissertation. The first one is to compare
the performance of existing explained variation measures in survival analysis
by choosing between two classification scores. Explained variation in survival
analysis is the counterpart of in the general linear model. Classification
scores are commonly used to assign patient to different prognostic groups
based on survival outcome. The scores are typically based on a set of patient
characteristics. These characteristics can often be combined in different ways
to give competing methods of scoring. We examine how well a number of measures
of explained variation suggested in the survival literature perform in
selecting the best classification score. A Monte Carlo study is designed and
implemented. Practical recommendations are provided.
Another topic is about the hypothesis testing for current leukemia-free
survival. Relapse after bone marrow transplant for patients with leukemia is
looked on as failure of the treatment. In recent years, donor lymphocyte
infusion (DLI) has been suggested as a new approach to treat the patients who
relapse. Of clinical interest would then be the probability that a patient is
alive and leukemia-free at a given time point after the transplant with or
without DLI. This probability is called the current leukemia-free survival
probability (CLFS). Klein (2000, British Journal of Hematology) introduced a
linear combination of three survival curves as an estimator of CLFS. Based on
this estimator, we construct a series of test statistics to compare CLFS
between two groups. We present simulation results to estimate type I error
rates and power of these testing methods under different scenarios. Real data
are analyzed as an illustration of these testing methods. Practical
recommendations are provided.
Seminar page
Search for Level Sets of Functions by Computer Experiments
In engineering and other fields, it is common to use a
computer simulation to model a real world process. The inputs
to a function f represent factors that influence the outcome.
The output is a measure to be optimized. L is a specified
level; the objective is to find the inputs for which output
is above L. L may be a tolerance level. Then the inputs for
which response>L form a tolerance region. We might estimate
this region by evaluating f on a grid, but even a coarse grid
may have many points, and function f may be costly to
evaluate. The objective is to estimate the tolerance region
with as few evaluations as possible.
We approach this problem with a sequential search. Use
data to fit a spatial process that approximates f. This
process gives an estimate of the L-contour, and can be used
to estimate how much information would be gained if f is
evaluated at point p. Choose points where the estimated
value of f is L, but where uncertainty is high. Evaluate f
at chosen points. Augment set of data points and set of data values.
Repeat procedure with augmented data. Calculate convergence
criteria after each iteration, and stop when criteria reach set goals.
The search process is applied to several functions
defined in low dimensional space. Finally, it is applied to
an actual simulation function.
Seminar page
Strong and Weak Laws of Large Numbers for Weighted Sums of I.I.D. Banach
Space Valued Random Elements with Slowly Varying Weights
For a sequence of i.i.d. random elements {V(n),n=1,2,...} in a
real Banach space X and a slowly varying function L, two findings
concerning the normed weighted sum U(n)=(L(1)V(1)+...L(n)V(n))/(L(n)n)
will be presented in the program. The first one asserts that U(n)
converges almost certainly to an element v in X if and only if (i)
||V(1)|| is integrable and (ii) EV(1)=v. Moreover, when X is of
Rademacher type p where p is in (1,2], the second result presents
conditions (which are weaker than (i) and (ii) above) under which U(n)
converges in probability to some element of X. This result can fail if
the Rademacher type p proviso is dispensed with. Both of these limit
theorems are new results even when X is the real line. This work is
joint with Robert L. Taylor (Clemson University, Clemson, South Carolina).
Seminar page
On Zeros in the Sign and Signed-Rank Tests
When zeros (or ties within pairs) occur in data being analyzed with a
sign or a signed-rank test, nonparametric methods textbooks and software
consistently recommed that the zeros be deleted and the data analyzed as
though the ties did not exist. This advice is not consistent with the
objectives of the majority of applications. In most settings a better
approach would be to view the tests as testing hypotheses about a
population median. There are relatively simple p-values available that
are consistent with this viewpoint of the tests. These methods produce
tests with good power properties for testing a different (often more
appropriate) set of hypotheses than those addressed by tests that delete
the zeros.
Seminar page
An Alternate Version of the Conceptual Predictive Statistic
The conceptual predictive statistic, Cp, is a widely used criterion for model
selection in linear regression. Cp serves as an approximately unbiased
estimator of a discrepancy, a measure that reflects the disparity between the
generating model and a fitted candidate model. This discrepancy, based on
scaled squared error loss, is asymmetric: an alternate measure is obtained by
reversing the roles of the two models in the definition of the measure. We
propose a variant of the Cp statistic based on estimating a symmetrized version
of the discrepancy targeted by Cp. We claim that the resulting criterion
provides better protection against overfitting than Cp, since the symmetric
discrepancy is more sensitive to overspecification than its asymmetric
counterpart. We illustrate our claim by presenting simulation results.
Finally, we demonstrate the practical utility of the new criterion by
discussing a modeling application based on data collected in a cardiac
rehabilitation program at University of Iowa Hospitals and Clinics.
Seminar page
Adaptive MCMC: A Java Applet's Perspective
MCMC algorithms are a very popular method of approximately sampling
from complicated probability distributions. A wide variety of MCMC
schemes are available, and it is tempting to have the computer
automatically "adapt" the algorithm while it runs, to improve and tune
on the fly. However, natural-seeming adaptive schemes often fail to
preserve the stationary distribution, thus destroying the fundamental
ergodicity properties necessary for MCMC algorithms to be valid. In
this talk, we review adaptive MCMC, and present simple conditions
which ensure ergodicity (proved using intuitive coupling
constructions, jointly with G.O. Roberts). The ideas are illustrated
using the very simple adaptive Metropolis algorithm animated by the
java applet at: probability.ca/jeff/java/adapt.html
Seminar page
Exploratory Data Analysis with Posterior Plots
The use of techniques of exploratory data analysis represents an important
stage in many statistical investigations. One of the attractive features of a
Bayesian analysis is that it sometimes lends itself well to graphical summary.
To do this it is generally necessary to restrict attention to a small number
of key parameters. In this paper we describe how some of the principal
computational problems associated with implementing a graphical Bayesian
analysis based on posterior plots can be solved whenever an appropriate
likelihood function can be specified. We provide access to all relevant
software for intending users through a website. We show, via a prototypical
example, how the posterior plots delivered by our software are better behaved
than estimates of those posterior distributions generated from a Monte Carlo
Markov Chain approach. Among other things, we provide an algorithm for
estimating efficient starting values for the numerical integration required
for the Bayesian analysis. Nuisance parameters are handled in two ways: by
incorporating them directly into the computation of exact posterior
distributions; and by concentrating them out of a conditional analysis at an
early stage when the former approach is infeasible. The latter proposal
facilitates the handling of higher dimensional nuisance parameter vectors.
Examples using simulated and real economic data are presented to illustrate
the efficacy of the approach.
Based on joint work with Andy Tremayne and John Naylor
Seminar page
The NCI Biostatistics Grant Portfolio and the NIH Funding
Mechanism
The talk consists of two parts. In Part I, I will briefly talk about the
website: www.statfund.cancer.gov. This website contains information about a
large proportion of the NIH funded grants in Biostatistics. These grants are
housed in the Division of Cancer Control and Population Sciences (DCCPS) at
the National Cancer Institute (NCI). I will also discuss various funding
opportunities in (Bio-)statistics at the NCI. In Part II, I will go over the
NIH funding mechanisms and discuss the grant review process at NIH in great
detail.
Seminar page
Models with Robustness to Outliers
In his seminal 1964 paper on robust analysis, Huber introduced a distribution
that is centrally Normal, with Exponential tails beyond some pre-specified
changepoint. On fitting for a location parameter, the MLE was shown to be
"most robust" in a certain sense. More generally, interpreted in terms of
influence functions, it has provided a simple method for downweighting
extreme, outlying data points in linear regression analyses which aim to fit
the bulk of the data well.
Extending the influence function approach to non-linear or hierarchical
models is however far less simple, leading to an absence of this form of
robustness in many areas. We therefore propose a simple location-scale family
based around the heavy-tailed 'Huber distribution', which provides a
model-based analogue of Huber's estimation methods. For simultaneously robust
inference on both location and scale, standard likelihood methods applied to
this family give results extremely closely related to Huber's well-known but
more ad-hoc "Proposal 2".
Further justification for our empirical approach is provided by examining
this fully-specified model in terms of constituent 'signal' and 'contaminant'
parts. These have several attractive operating characteristics which are both
simply understood and of broad practical appeal. The full specification of a
likelihood for the data allows simple extensions to be made for robust
inference in many complex models; a selection of examples will be given.
Seminar page
Analysis of Microarray Gene Expression Data with Nonparametric
Hypothesis Testing
This talk will present two nonparmatric methods suitable for
clustering of time course gene expression and replicated microarray
data respectively. Time course gene expression data (curves) are
modeled as dynamic alpha-mixing processes that allow complex
correlations in gene expression time series. A nonparametric goodness
of fit test is developed to screen out flat curves before clustering
and an agglomerative procedure is used to search for the highly
probable set of clusters of nonflat curves. Replicated microarray data
are commonly seen in disease/cancer profiling in which matched
tumor/normal expression array from individual patients are available.
These data are represented as a mixture of unknown distributions, with
gene expressions in each cluster generated from the same distribution.
Some of the clusters consist of genes that are overexpressed while
others contain genes that are underexpressed. A divisive procedure is
developed for clustering such data. In both procedures, the similarity
measure between clusters is defined as the p values from nonparametric
multivariate hypothesis testing for corresponding hypotheses. The
number of distinct clusters is determined automatically by specified
significance levels. The test statistics use overall ranks of
expressions so that the result is invariant to monotone
transformations of data. Simulation and two microarray data sets are
used to identify the properties of the method.
Seminar page
Powerful Strategies for Linkage Disequilibrium Mapping by Exploiting
Gene-Gene and Gene-Environment Interactions
In the "indirect" approach for fine mapping of disease genes, the association
between the disease and a genomic region is studied using a set of marker SNPs
that are likely to be in linkage disequilibrium with the underlying causal
loci/haplotypes, if any exists. The SNPs themselves may not be causal. In
this study, we propose novel strategies for testing associations in
marker-based genetic association studies incorporating gene-gene and
gene-environment interactions, two sources of heterogeneity expected to be
present for complex diseases like cancers. We propose a parsimonious approach
to modeling interactions by exploiting the fact that each individual marker
within a gene is unlikely to have its own distinct biologic functions, but,
instead, the markers are likely to be "surrogates" for a common
underlying "biologic phenotype" for the gene, which, by itself, or
by interacting with other genetic or/and environmental products, causes the
disease. We use this approach to develop powerful tests of association in
terms of observable genetic markers, assuming that the biologic phenotypes
themselves are latent (not directly observable). We studied performance of
the proposed methods under different models for gene-gene interactions using
simulated data following the design of a case-control study that we have
recently undertaken to investigate the association between prostate cancer
and candidate genes encoding for selenoenzymes. We also illustrate the
utility of the proposed methodology using real data from a case-control study
for discovering association between colorectal adenoma and DNA variants in
NAT2 genomic region, accounting for smoking-NAT2 interaction. Both
applications clearly demonstrate major power advantage of the proposed
methodology over two standard tests for association, one ignoring
gene-gene/gene-environment interactions and the other based on a saturated
model for interactions.
Seminar page
Simple Bayesian Estimators for Molecular Genealogy
Genealogists have rapidly been turning to DNA markers to connect
relatives lacking a paper trail. As we will detail, the markers of choice are
blocks of loci on non-recombining regions, in particular the Y chromosome.
While simple likelihood methods can be used to estimate a time to most
recent common ancestor based on marker information, these have several
problems (which will be discussed). Bayesian estimators provide a nice
solution, and we will examine several estimators based upon increasingly more
realistic models of mutation. Much of this work is discussed in Walsh, B.
2001. Estimating the time to the MRCA for the Y chromosome or mtDNA for a
pair of individuals, Genetics 158: 897--912.
Seminar page
Nonparametric and Semiparametric Regression for Longitudinal/Clustered
Data
We consider nonparametric and semiparametric regression estimation
for longitudinal/clustered data
using kernel and spline
methods. We show that unlike independent data, common kernels and
splines
are not asymptotically equivalent for clustered/longitudinal data.
Conventional kernel extensions of GEEs fail to account for the
within-cluster
correlation, while spline methods are able to account for this
correlation.
We identify an asymptotically equivalent kernel for the smoothing spline
for clustered/longitudinal data. The results are extended to likelihood
settings.
We next consider semiparametric regression
models, where some covariate effects are modeled parametrically,
while others are modeled nonparametricaly. We derive the semiparametric
efficient score and show the profile/kernel or profile/spline
estimator is semiparametric efficient. The proposed methods are
applicable
to a
wide range of longitudinal/clustered data, including missing data and
measurement
error problems. The methods are illustrated using simulation studies and
data examples.
Seminar page
Functions, Curves and Images: Modeling Shape and Variability
I will present an overview of self-modeling for functional data. Functional
data are obtained when the ideal observation for each experimental unit is a
function (a curve or outline). Since it is not possible to observe the entire
function, the data for each experimental unit consist of a number of noisy
observations of the function at various points. I am interested in methods for
analyzing such data which are based on an underlying statistical model. This
allows conclusions to be drawn about the likelihood of real differences in the
observed curves from say two different groups of subjects.
Lawton et al. (1972) propose the self-modeling approach for functional data.
Their method is based on the assumption that all individuals response curves
have a common shape and that a particular individual's curve is some simple
transformation of the common shape curve. This basic model can be expanded by
adding the assumption that the transformation parameters are random in the
population (Lindstrom, 1995). This allows a more natural approach to complex
data.
This methodology can also be generalized to two-dimensional response curves
such as those that arise in speech kinematics and other areas of research
on motion (Ladd and Lindstrom, 2000). These parameterized curves are usually
obtained by recording the two-dimensional location of an object over time. In
this setting, time is the independent variable, and the (two-dimensional)
location in space is the response. Collections of such parameterized curves
can be obtained either from one subject or from a number of different
subjects, each 1 producing one or more repetitions of the response curve.
Finally, the methods for two-dimensional, time-parameterized curves can be
extended to model outlines (closed curves) collected from medical or other
images. For example, in a study of autistic and normally developing children,
the outlines of the corpus callosum were collected from brain MRIs.
Self-modeling allows us to model the outlines, describe the variability within
each group and also assess the existance of meaninful difference between the
groups.
Seminar page