Using Pilot Study to Increase Efficiency in Clinical Trials
Pilot study is often conducted to determine the sample size required
for a clinical trial. Due to the difference in the sampling
environments, the pilot study data are usually discarded after
sample size calculation. They are not embedded into
the new data for the final analysis. This paper tries to use the pilot
information to modify the subsequent testing procedure when the
t-test is used to compare two treatments. The new test maintains the
required probability of making the Type I error, but increases the
power and consequently reduces the sample size requirement. It loses
power only when the pilot study is a bad sample, i.e., of little
resemblance to the real data. Due to the small likelihood of the
latter, the new approach is a viable alternative to the current practice.
This is join work with Sam Wu.
Seminar page
A Shrinkage Estimator for Spectral Densities
We propose a shrinkage estimator for spectral densities based on a multilevel
normal hierarchical model. The first level captures the sampling
variability via a likelihood constructed using the asymptotic properties of
the periodogram. At the second level, the spectral density is shrunk towards
a parametric time series model. To avoid selecting a particular parametric
model for the second level, a third level is added which induces an estimator
that averages over a class of parsimonious time series models. The
estimator derived from this model, the model averaged shrinkage
estimator (MASE), is consistent, is shown to be highly competitive with
other spectral density estimators via simulations, and is computationally
inexpensive.
Seminar page
Projected Multivariate Linear Models for Directional Data
We consider the spherically projected multivariate linear model for
directional data in the general d-dimensional case. This model treats
directional observations as projections onto the unit sphere of unobserved
responses from a multivariate linear model. A formula for the mean resultant
length of the projected normal distribution, which is the underlying
distribution of the model, is presented for the general d-dimensional
case. We show that maximum likelihood estimates for the model can be computed
using iterative methods. We develop a test for equal mean directions of
several populations assuming a common unknown concentration. An example with
three-dimensional anthropological data is presented to demonstrate the
application of the test. We also consider random effects models for
directional data, and show how maximum likelihood estimates can be computed.
Seminar page
Multiclass Cancer Diagnosis Using Bayesian Kernel Machine Models
Precise classification of tumors is critical, for cancer diagnosis
and treatment. In recent years, several works showed successful
classification of tumor types using gene expression patterns. Thus
gene expression data is proving to be a very promising tool in
cancer diagnosis. However, the simultaneous classification across
a heterogeneous set of tumor types has not been well-studied yet.
Usually, this multicategory classification problems are solved by
using a binary classifiers which may fail in a variety of
circumstances. We tackle the problem of cancer classification in
the context of multiple-tumor types. We develop a full
probabilistic model-based approach, specifically probabilistic
relevance vector machine (RVM), as well as support vector machines
(SVM) for multicategory classification. A hierarchical model is
also proposed where the unknown smoothing parameter is interpreted
as a shrinkage parameter. We assign a prior distribution to it and
obtain its posterior distribution using a Bayesian computation. In
this way, not only, do we, obtain the point predictors but also
the associated measures of uncertainty.
We also propose a Bayesian variable selection method for selecting
the differentially expressed genes, integrated with our RVM and
SVM models for improved classification. Our method makes use of
mixture priors and Markov chain Monte Carlo technique to identify
the important predictors (genes) and classify samples
simultaneously. We have applied our methods on two different
microarray datasets to identify differentially expressed genes and
compared their classification performance with the existing
methods.
Seminar page
Testing for Misspecification of Parametric Models
Because model misspecification can lead to
inconsistent and inefficient estimators and invalid
tests of hypotheses, testing for misspecification is
critically important. The IOS test recently proposed
by Presnell & Boos (2004) is a general purpose
goodness-of-fit test which can be applied to assess
the adequacy
of a wide variety of parametric models without
specifying an alternative model. The test is based on
the ratio of in-sample and out-of-sample likelihoods,
and can be viewed asymptotically as a contrast between
two estimates of the information matrix that are equal
under correct model specification. The statistic is
asymptotically normally distributed, but parametric
bootstrap is recommended for computing the p-value of
the test. Using properties of locally asymptotically
normal parametric models we prove that the parametric
bootstrap provides a consistent estimate of the null
distribution of the IOS statistic under quite general
conditions. Finally, we compare the performance of the
IOS test with existing goodness-of-fit tests in
several applications and through simulations involving
models such as logistic regression, Cox
proportional-hazards regression, beta-binomial models,
zero-inflated Poisson models, and autoregressive of
order 1 models. While the IOS test is broadly
applicable and not restricted to a particular model,
the applications show that its performance is
comparable or superior to that of other
goodness-of-fit tests. Moreover, the IOS test is
automatic and easy to employ, and useful for
situations in which sometimes the Information Matrix
test is the only competitor.
Seminar page
Tips For Improving Your Public Speaking Skills
Public speaking is a fundamental skill required for a successful professional
career. Since a talk at a professional meeting or job interview can make
or break you, it is imperative that you take this task seriously. The goal
of this talk is to indicate positive steps for improving your public
speaking skills. The talk begins by recounting the 13 worst disasters I
have observed, including examples of miserable presentations by prominent
researchers. I then indicate positive steps you might take to improve your
own presentations by discussing the structure a well-organized talk and by
offering tips for avoiding commonly made mistakes.
Seminar page
Estimation of Bayes Factors for Nonparametric Bayes Problems
via Radon-Nikodym Derivatives
We consider situations in Bayesian analysis where the prior is
indexed by a hyperparameter taking on a continuum of values. We
distinguish some arbitrary value of the hyperparameter, and consider
the problem of estimating the Bayes factor for the model indexed by
the hyperparameter vs. the model indexed by the distinguished
point, as the hyperparameter varies. We assume that we are able to
get iid or Markov chain Monte Carlo samples from the posterior
distribution for a finite number of the priors. We show that it is
possible to estimate the family of Bayes factors if are able to
calculate the likelihood ratios for any pair of priors, and we show
how to obtain these likelihood ratios in some nonparametric Bayesian
problems. We illustrate the methodology through two detailed
examples involving model selection problems.
Seminar page
Sign and Signed-Rank Tests for Clustered Data
In this talk, I will present some recent results concerning
sign and signed-rank tests for the univariate and multivariate
one-sample problem with cluster correlated data. The major
part of the talk will focus on a new weighted multivariate
sign test. Asymptotic properties including Pitman ARE and
some simulation results will be presented.
Seminar page
Measures of Financial Risk: Connections With Quantile Regression and an
Application to the Discovery of Optimal Alloys
This will be a two-part talk focusing on newly discovered connections
between the seemingly disparate topics of: measures of risk (finance),
selection differentials (genetics), and the now well-established field
of quantile regression first introduced by Koenker and Bassett (1978)
(statistics, econometrics). The first part will look at estimation and
discuss asymptotic properties of some common measures of risk. A
connection is made between nonparametric estimators of the upper tail
mean, the Conditional Value-at-Risk (CVaR), and the mean improvement
when selecting the top fraction of subjects in genetic breeding
experiments. We will then show that in a regression context, estimators
derived from a generalization of least absolute deviations (quantile
regression), coincide with estimators obtained by minimizing the
residual error distance of the upper tail mean from the overall mean.
The second part will be an application to modeling the mechanical
properties of steel alloys via quantile regression. Model predictions
are shown to give comparable results to manual selection when used to
rank steels. We will show how the models can thus be used in conjunction
with physical data to find better steels by solving a large constrained
optimization problem. The approach may greatly accelerate the discovery
of new optimized alloys, hitherto a slow and laborious process relying
exclusively on physical experimentation.
Seminar page
Generating a Uniformly Random Configuration from Various
Combinatorial Urn Models
In traditional combinatorial urn models, a number of balls are to be randomly
distributed into a number of urns. In some models the balls may or may not be
distinguishable; and likewise, the urns may or may not be distinguishable. We
present a very general algorithm that induces several Markov chains whose
stationary distributions are uniform on the set of configurations in many of
these models. We will also explore the mixing times of these Markov chains.
Specific examples that we will consider are a random walk on the hypercube,
the Ehrenfest urn model, a Markov chain on the partitions of a set of objects,
and a random walk among Young diagrams.
Seminar page
Markov Chain Conditions for Admissibility
Suppose that $X$ is a random vector with density $f(x|\theta)$ and
that $\pi(\theta|x)$ is a proper posterior density corresponding to an
improper prior $\nu(\theta)$. The prior is called strongly admissible
if the formal Bayes estimator of every bounded function of $\theta$ is
admissible under squared error loss. Eaton (1992, Annals of
Statistics) showed that recurrence of a certain Markov chain, $W$,
defined in terms of $f$ and $\nu$, implies the strong admissibility of
$\nu$. Hobert and Robert (1999, Annals of Statistics) showed that $W$
is recurrent if and only if a related Markov chain, $\tilde{W}$, is
recurrent. I will show that when $X$ is an exponential random
variable, a fairly thorough analysis of the Markov chain $\tilde{W}$
is possible and this leads to a simple sufficient condition for strong
admissibility. I will also explain how the relationship between $W$
and $\tilde{W}$ can be used to establish that certain perturbations of
strongly admissible priors retain strong admissibility. (This is
joint work with M. Eaton, G. Jones, D. Marchev and J. Schweinsberg.)
Seminar page
The Analysis of Means
The Analysis of Means (ANOM) is a method for comparing means, rates, and
proportions. The basic idea is to create an ANOM decision chart (similar in
appearance to a Shewhart process control chart) on which means, rates, or
proportions are plotted along with decision lines associated with a particular
level of significance. Similarly, the Analysis of Means for Variances (ANOMV)
employs a decision chart on which sample variances are plotted. Since analysis
of means type tests offer an easy to understand decision chart that is useful
for conveying results to non-statisticians, they are often a nice alternative
to the ANOVA-F test, Pearson's chi-squared test for an I by 2 table,
one-way Poisson regression, or Bartlett's test. There is also an ANOM
version of Levene's test. This talk will present some basics for
practitioners who might want to use ANOM in collaborative work. Some SAS
examples using PROC ANOM (now available in version 9.1) will be presented.
Results from a study comparing ANOM to Pearson's test will be presented
as well as a SAS program that produces accurate power estimates useful in
prospective or retrospective applications for both balanced and unbalanced
binomial data. The SAS program also estimates the probability that a
particular population "signals" by plotting outside the ANOM
decision limits. Those wanting more information can visit analysisofmeans.com.
Seminar page
Sampling From Massive Databases
The largest databases are now so large that answering complicated,
analytic queries can take hours or days. One alternative to providing
exact answers to queries is to make use of approximation techniques,
with sampling being a prime candidate. The application of sampling to
database queries provides a challenging set of problems, since
sampling in a database environment is quite different than traditional
finite population or survey sampling. For example, in a database
nearly all queries compute complicated functions involving operations
such as set differences over multiple populations, making inference
particularly challenging. Another difference is that in a database,
it is possible to pre-compute and store a large number of exact
population totals offline, and then use those values to help with
subsequent inference problems. This talk will consider some of the
challenges involved in applying sampling techniques to the largest
databases.
Seminar page
Applications of Dirac's Delta Function in Statistics
The Dirac delta function has been used for many years in mathematical
physics. The purpose of this talk is to bring attention to several
useful applications of this function in mathematical statistics. Some
of these applications include a unified representation of the
distribution of a function (or functions) of one or several random
variables, which may be discrete or continuous, a proof of a well-known
inequality, and a representation of a density function in terms of its
noncentral moments.
Seminar page
Estimation of the Mediation Effect with a Binary Mediator
A mediator acts as a third variable in the causal pathway between a
risk factor and an outcome. We consider the estimation of the mediation
effect when the mediator is a binary variable. We give a precise definition of
the mediation effect and examine asymptotic properties of five different
estimators of the mediation effect. Our theoretical developments, which are
supported by a Monte Carlo study, show that the estimators that account for
the binary nature of the mediator are consistent for the mediation effect
while other estimators are inconsistent. We use these estimators to study
the mediation effect of chronic cerebral infarction in the causal relationship
between the apolipoprotein E epsilon-4 allele and cognitive function
among 233 deceased participants from the Religious Orders Study, a
longitudinal, clinical-pathologic study of aging and Alzheimer's disease.
Seminar page
Correlation and Large-Scale Significance Testing
Large-scale hypothesis testing problems, with hundreds or thousands
of test statistics "z[i]" to consider at once, have become commonplace in
current practice. Applications of popular analysis methods such as false
discovery rates do not require independence of the z[i]'s but their accuracy
can be compromised in high-correlation situations. This talk discusses
methods, both theoretical and computational, for assessing the size and
effect of correlation in large-scale testing problems. Two microarray
examples will be used to illustrate the ideas. The examples show surprisingly
large correlations that badly destabilize standard statistical analyses, but
newer methods can remedy at least part of the trouble.
Seminar page
Fifty Years of Empirical Bayes
Scientific inference is the process of reasoning from observed
data back to its underlying mechanism. The two great schools of statistical
inference, Bayesian and frequentist, have competed over the past two
centuries, often bitterly, for scientific supremacy. Empirical Bayes, a
novel hybrid, appreared in the early 1950's, showing promise of immense
possible gains in inferential accuracy. Nevertheless it has languished
in the statistics literature, with its gains viewed as suspicious and
even paradoxical by Bayesians and frequentists alike. New scientific
technology, exemplified by dna microarrays, has suddenly revived interest
in empirical Bayes methods. This talk, which is aimed at a general scientific
audience, examines the ideas involved through a series of real examples,
and proceeds with a minimum of technical development.
Seminar page
Some Lower Bounds to the Variances of Estimating Functions
Godambe (1960) obtained a lower bound to the variance of a estimating function
under some regularity conditions. We shall obtain here a lower bound to the
variance which does not require Wald-Wolfwitz regularity conditions. When
there is only one nuisance parameter, we shall, using Bhattacharyya's (1946)
scores obtain another lower bound to the variance of an estimating function
for one interesting parameter. The use of these results in obtaining an
optimal estimating function in the presence of nuisance parameters will be
indicated.
Seminar page
Comparison of Bayesian Objective Procedures for Variable
Selection in Linear Regression
In the objective Bayesian approach to variable selection in regression
a crucial point is the encompassing of the underlying nonnested linear
models. Once the models have been encompassed one can define objective
priors for the multiple testing problem involved in the variable selection
problem.
There are two natural ways of encompassing: one way is to encompass
all models into the model containing all possible regressors, and the other
one is to encompass the model containing the intercept only into any other.
In this paper we compare the variable selection procedures that result
from each of the two mentioned ways of encompassing: Relations with
frequentist criteria for model selection such as those
based on the adjusted R2, lasso, and Mallows Cp are provided incidentally.
Seminar page