Leiyan Lu, Medical College of Wisconsin

Explained Variation in Survival Analysis and Hypothesis Testing for Current Leukemia-Free Survival

Two topics were investigated in my dissertation. The first one is to compare the performance of existing explained variation measures in survival analysis by choosing between two classification scores. Explained variation in survival analysis is the counterpart of in the general linear model. Classification scores are commonly used to assign patient to different prognostic groups based on survival outcome. The scores are typically based on a set of patient characteristics. These characteristics can often be combined in different ways to give competing methods of scoring. We examine how well a number of measures of explained variation suggested in the survival literature perform in selecting the best classification score. A Monte Carlo study is designed and implemented. Practical recommendations are provided.

Another topic is about the hypothesis testing for current leukemia-free survival. Relapse after bone marrow transplant for patients with leukemia is looked on as failure of the treatment. In recent years, donor lymphocyte infusion (DLI) has been suggested as a new approach to treat the patients who relapse. Of clinical interest would then be the probability that a patient is alive and leukemia-free at a given time point after the transplant with or without DLI. This probability is called the current leukemia-free survival probability (CLFS). Klein (2000, British Journal of Hematology) introduced a linear combination of three survival curves as an estimator of CLFS. Based on this estimator, we construct a series of test statistics to compare CLFS between two groups. We present simulation results to estimate type I error rates and power of these testing methods under different scenarios. Real data are analyzed as an illustration of these testing methods. Practical recommendations are provided.
Seminar page

Curtis Miller, University of California, Riverside

Search for Level Sets of Functions by Computer Experiments

In engineering and other fields, it is common to use a computer simulation to model a real world process. The inputs to a function f represent factors that influence the outcome. The output is a measure to be optimized. L is a specified level; the objective is to find the inputs for which output is above L. L may be a tolerance level. Then the inputs for which response>L form a tolerance region. We might estimate this region by evaluating f on a grid, but even a coarse grid may have many points, and function f may be costly to evaluate. The objective is to estimate the tolerance region with as few evaluations as possible.

We approach this problem with a sequential search. Use data to fit a spatial process that approximates f. This process gives an estimate of the L-contour, and can be used to estimate how much information would be gained if f is evaluated at point p. Choose points where the estimated value of f is L, but where uncertainty is high. Evaluate f at chosen points. Augment set of data points and set of data values. Repeat procedure with augmented data. Calculate convergence criteria after each iteration, and stop when criteria reach set goals.

The search process is applied to several functions defined in low dimensional space. Finally, it is applied to an actual simulation function.
Seminar page

Andrew Rosalsky, University of Florida

Strong and Weak Laws of Large Numbers for Weighted Sums of I.I.D. Banach Space Valued Random Elements with Slowly Varying Weights

For a sequence of i.i.d. random elements {V(n),n=1,2,...} in a real Banach space X and a slowly varying function L, two findings concerning the normed weighted sum U(n)=(L(1)V(1)+...L(n)V(n))/(L(n)n) will be presented in the program. The first one asserts that U(n) converges almost certainly to an element v in X if and only if (i) ||V(1)|| is integrable and (ii) EV(1)=v. Moreover, when X is of Rademacher type p where p is in (1,2], the second result presents conditions (which are weaker than (i) and (ii) above) under which U(n) converges in probability to some element of X. This result can fail if the Rademacher type p proviso is dispensed with. Both of these limit theorems are new results even when X is the real line. This work is joint with Robert L. Taylor (Clemson University, Clemson, South Carolina).
Seminar page

Ronald Randles, University of Florida

On Zeros in the Sign and Signed-Rank Tests

When zeros (or ties within pairs) occur in data being analyzed with a sign or a signed-rank test, nonparametric methods textbooks and software consistently recommed that the zeros be deleted and the data analyzed as though the ties did not exist. This advice is not consistent with the objectives of the majority of applications. In most settings a better approach would be to view the tests as testing hypotheses about a population median. There are relatively simple p-values available that are consistent with this viewpoint of the tests. These methods produce tests with good power properties for testing a different (often more appropriate) set of hypotheses than those addressed by tests that delete the zeros.
Seminar page

Joseph E. Cavanaugh, Department of Biostatistics, The University of Iowa

An Alternate Version of the Conceptual Predictive Statistic

The conceptual predictive statistic, Cp, is a widely used criterion for model selection in linear regression. Cp serves as an approximately unbiased estimator of a discrepancy, a measure that reflects the disparity between the generating model and a fitted candidate model. This discrepancy, based on scaled squared error loss, is asymmetric: an alternate measure is obtained by reversing the roles of the two models in the definition of the measure. We propose a variant of the Cp statistic based on estimating a symmetrized version of the discrepancy targeted by Cp. We claim that the resulting criterion provides better protection against overfitting than Cp, since the symmetric discrepancy is more sensitive to overspecification than its asymmetric counterpart. We illustrate our claim by presenting simulation results. Finally, we demonstrate the practical utility of the new criterion by discussing a modeling application based on data collected in a cardiac rehabilitation program at University of Iowa Hospitals and Clinics.
Seminar page

Jeffrey S. Rosenthal, University of Toronto

Adaptive MCMC: A Java Applet's Perspective

MCMC algorithms are a very popular method of approximately sampling from complicated probability distributions. A wide variety of MCMC schemes are available, and it is tempting to have the computer automatically "adapt" the algorithm while it runs, to improve and tune on the fly. However, natural-seeming adaptive schemes often fail to preserve the stationary distribution, thus destroying the fundamental ergodicity properties necessary for MCMC algorithms to be valid. In this talk, we review adaptive MCMC, and present simple conditions which ensure ergodicity (proved using intuitive coupling constructions, jointly with G.O. Roberts). The ideas are illustrated using the very simple adaptive Metropolis algorithm animated by the java applet at: probability.ca/jeff/java/adapt.html
Seminar page

John Marriott, Nottingham Trent University

Exploratory Data Analysis with Posterior Plots

The use of techniques of exploratory data analysis represents an important stage in many statistical investigations. One of the attractive features of a Bayesian analysis is that it sometimes lends itself well to graphical summary. To do this it is generally necessary to restrict attention to a small number of key parameters. In this paper we describe how some of the principal computational problems associated with implementing a graphical Bayesian analysis based on posterior plots can be solved whenever an appropriate likelihood function can be specified. We provide access to all relevant software for intending users through a website. We show, via a prototypical example, how the posterior plots delivered by our software are better behaved than estimates of those posterior distributions generated from a Monte Carlo Markov Chain approach. Among other things, we provide an algorithm for estimating efficient starting values for the numerical integration required for the Bayesian analysis. Nuisance parameters are handled in two ways: by incorporating them directly into the computation of exact posterior distributions; and by concentrating them out of a conditional analysis at an early stage when the former approach is infeasible. The latter proposal facilitates the handling of higher dimensional nuisance parameter vectors. Examples using simulated and real economic data are presented to illustrate the efficacy of the approach.

Based on joint work with Andy Tremayne and John Naylor
Seminar page

Ram Tiwari, National Institutes of Health

The NCI Biostatistics Grant Portfolio and the NIH Funding Mechanism

The talk consists of two parts. In Part I, I will briefly talk about the website: www.statfund.cancer.gov. This website contains information about a large proportion of the NIH funded grants in Biostatistics. These grants are housed in the Division of Cancer Control and Population Sciences (DCCPS) at the National Cancer Institute (NCI). I will also discuss various funding opportunities in (Bio-)statistics at the NCI. In Part II, I will go over the NIH funding mechanisms and discuss the grant review process at NIH in great detail.
Seminar page

Kenneth Rice, Department of Biostatistics, University of Washington

Models with Robustness to Outliers

In his seminal 1964 paper on robust analysis, Huber introduced a distribution that is centrally Normal, with Exponential tails beyond some pre-specified changepoint. On fitting for a location parameter, the MLE was shown to be "most robust" in a certain sense. More generally, interpreted in terms of influence functions, it has provided a simple method for downweighting extreme, outlying data points in linear regression analyses which aim to fit the bulk of the data well.

Extending the influence function approach to non-linear or hierarchical models is however far less simple, leading to an absence of this form of robustness in many areas. We therefore propose a simple location-scale family based around the heavy-tailed 'Huber distribution', which provides a model-based analogue of Huber's estimation methods. For simultaneously robust inference on both location and scale, standard likelihood methods applied to this family give results extremely closely related to Huber's well-known but more ad-hoc "Proposal 2".

Further justification for our empirical approach is provided by examining this fully-specified model in terms of constituent 'signal' and 'contaminant' parts. These have several attractive operating characteristics which are both simply understood and of broad practical appeal. The full specification of a likelihood for the data allows simple extensions to be made for robust inference in many complex models; a selection of examples will be given.
Seminar page

Haiyan Wang, Kansas State Univeristy

Analysis of Microarray Gene Expression Data with Nonparametric Hypothesis Testing

This talk will present two nonparmatric methods suitable for clustering of time course gene expression and replicated microarray data respectively. Time course gene expression data (curves) are modeled as dynamic alpha-mixing processes that allow complex correlations in gene expression time series. A nonparametric goodness of fit test is developed to screen out flat curves before clustering and an agglomerative procedure is used to search for the highly probable set of clusters of nonflat curves. Replicated microarray data are commonly seen in disease/cancer profiling in which matched tumor/normal expression array from individual patients are available. These data are represented as a mixture of unknown distributions, with gene expressions in each cluster generated from the same distribution. Some of the clusters consist of genes that are overexpressed while others contain genes that are underexpressed. A divisive procedure is developed for clustering such data. In both procedures, the similarity measure between clusters is defined as the p values from nonparametric multivariate hypothesis testing for corresponding hypotheses. The number of distinct clusters is determined automatically by specified significance levels. The test statistics use overall ranks of expressions so that the result is invariant to monotone transformations of data. Simulation and two microarray data sets are used to identify the properties of the method.
Seminar page

Nilanjan Chatterjee, National Cancer Institute

Powerful Strategies for Linkage Disequilibrium Mapping by Exploiting Gene-Gene and Gene-Environment Interactions

In the "indirect" approach for fine mapping of disease genes, the association between the disease and a genomic region is studied using a set of marker SNPs that are likely to be in linkage disequilibrium with the underlying causal loci/haplotypes, if any exists. The SNPs themselves may not be causal. In this study, we propose novel strategies for testing associations in marker-based genetic association studies incorporating gene-gene and gene-environment interactions, two sources of heterogeneity expected to be present for complex diseases like cancers. We propose a parsimonious approach to modeling interactions by exploiting the fact that each individual marker within a gene is unlikely to have its own distinct biologic functions, but, instead, the markers are likely to be "surrogates" for a common underlying "biologic phenotype" for the gene, which, by itself, or by interacting with other genetic or/and environmental products, causes the disease. We use this approach to develop powerful tests of association in terms of observable genetic markers, assuming that the biologic phenotypes themselves are latent (not directly observable). We studied performance of the proposed methods under different models for gene-gene interactions using simulated data following the design of a case-control study that we have recently undertaken to investigate the association between prostate cancer and candidate genes encoding for selenoenzymes. We also illustrate the utility of the proposed methodology using real data from a case-control study for discovering association between colorectal adenoma and DNA variants in NAT2 genomic region, accounting for smoking-NAT2 interaction. Both applications clearly demonstrate major power advantage of the proposed methodology over two standard tests for association, one ignoring gene-gene/gene-environment interactions and the other based on a saturated model for interactions.
Seminar page

Bruce Walsh, Department of Ecology and Evolutionary Biology, University of Arizona

Simple Bayesian Estimators for Molecular Genealogy

Genealogists have rapidly been turning to DNA markers to connect relatives lacking a paper trail. As we will detail, the markers of choice are blocks of loci on non-recombining regions, in particular the Y chromosome. While simple likelihood methods can be used to estimate a time to most recent common ancestor based on marker information, these have several problems (which will be discussed). Bayesian estimators provide a nice solution, and we will examine several estimators based upon increasingly more realistic models of mutation. Much of this work is discussed in Walsh, B. 2001. Estimating the time to the MRCA for the Y chromosome or mtDNA for a pair of individuals, Genetics 158: 897--912.
Seminar page

Xihong Lin, Harvard School of Public Health

Nonparametric and Semiparametric Regression for Longitudinal/Clustered Data

We consider nonparametric and semiparametric regression estimation for longitudinal/clustered data using kernel and spline methods. We show that unlike independent data, common kernels and splines are not asymptotically equivalent for clustered/longitudinal data. Conventional kernel extensions of GEEs fail to account for the within-cluster correlation, while spline methods are able to account for this correlation. We identify an asymptotically equivalent kernel for the smoothing spline for clustered/longitudinal data. The results are extended to likelihood settings. We next consider semiparametric regression models, where some covariate effects are modeled parametrically, while others are modeled nonparametricaly. We derive the semiparametric efficient score and show the profile/kernel or profile/spline estimator is semiparametric efficient. The proposed methods are applicable to a wide range of longitudinal/clustered data, including missing data and measurement error problems. The methods are illustrated using simulation studies and data examples.
Seminar page

Mary Lindstrom, Department of Biostatistics and Medical Informatics, University of Wisconsin Madison

Functions, Curves and Images: Modeling Shape and Variability

I will present an overview of self-modeling for functional data. Functional data are obtained when the ideal observation for each experimental unit is a function (a curve or outline). Since it is not possible to observe the entire function, the data for each experimental unit consist of a number of noisy observations of the function at various points. I am interested in methods for analyzing such data which are based on an underlying statistical model. This allows conclusions to be drawn about the likelihood of real differences in the observed curves from say two different groups of subjects.

Lawton et al. (1972) propose the self-modeling approach for functional data. Their method is based on the assumption that all individuals response curves have a common shape and that a particular individual's curve is some simple transformation of the common shape curve. This basic model can be expanded by adding the assumption that the transformation parameters are random in the population (Lindstrom, 1995). This allows a more natural approach to complex data.

This methodology can also be generalized to two-dimensional response curves such as those that arise in speech kinematics and other areas of research on motion (Ladd and Lindstrom, 2000). These parameterized curves are usually obtained by recording the two-dimensional location of an object over time. In this setting, time is the independent variable, and the (two-dimensional) location in space is the response. Collections of such parameterized curves can be obtained either from one subject or from a number of different subjects, each 1 producing one or more repetitions of the response curve.

Finally, the methods for two-dimensional, time-parameterized curves can be extended to model outlines (closed curves) collected from medical or other images. For example, in a study of autistic and normally developing children, the outlines of the corpus callosum were collected from brain MRIs. Self-modeling allows us to model the outlines, describe the variability within each group and also assess the existance of meaninful difference between the groups.
Seminar page