Estimation of Generalized Simple Measurement Error Models with
Instrumental Variables
Measurement error (ME) models are used in situations where
at least one independent variable in the model is imprecisely
measured. Having at least one independent variable measured
with error leads to an unidentified model and a bias in
the naive estimate of the effect of the variable that is
measured with error. One way to correct these problems is
through the use of an instrumental variable (IV). An IV
is one that is correlated with the unknown, or latent, true
variable, but uncorrelated with the measurement error of
the unknown truth and the model error. An IV provides the
identifying information in our method of estimating the
parameters for generalized simple measurement error (GSME)
models. The GSME model is developed and it is shown how
many well studied ME models with one predictor can fit into
its framework. Included in these are linear, generalized
linear, nonlinear, multinomial, multivariate regression,
and other ME models. The GSME model, by design, can handle
situation for continuous, discrete, and categorical observable,
or manifest, variables. We provide theorems that give conditions
under which the GSME model is identified. The initial step
in our estimation method is to "categorize"
all continuous and discrete variables. Categorical variables
remain unchanged. Assuming conditional independence given
the latent variable, the joint distribution of the categorized
manifest variables and any that were already categorical
is the product of the conditional cell probabilities and
conditional distributions of the categorized continuous
and discrete manifest variables summing over the categorical
values of the latent variable. Maximum likelihood estimates
of the joint categorical distribution are used to solve
nonlinear equations for the parameters of interest which
enter through the conditional probabilities. Estimated generalized
nonlinear least squares is used to solve the equations for
the parameters of interest. We show that our estimators
have favorable asymptotic properties and develop methods
of inference for them. We show how many commonly studied
ME model problems fit into the general framework developed
and how they can be solved using our method.
The In-and-Out-of-Sample (IOS) Likelihood Ratio Test
for Model Misspecification
A new test of model misspecification is proposed, based on the ratio
of in-sample and out-of-sample likelihoods. The test is broadly
applicable, and in simple problems approximates well known and
intuitively appealing methods. Using jackknife influence curve
approximations, it is shown that the test statistic can be viewed
asymptotically as a multiplicative contrast between two estimates of
the information matrix, both of which are consistent under correct
model specification. This approximation is used to show that the
statistic is asymptotically normally distributed, though it is
suggested that p-values be computed using the parametric bootstrap.
The resulting methodology is demonstrated with a variety of examples
and simulations involving both discrete and continuous data. This is
joint work with Dennis Boos.
Seminar page
Estimating Size and Composition of Biological Communities by
Modeling the Occurrence of Species
We develop a model that uses repeated observations
of a biological community to estimate the number and
composition of species in the community. Estimators of
community-level attributes are constructed from model-based estimators of
occurrence of individual species that also incorporate imperfect
detection of individuals. Data from the North
American Breeding Bird Survey are analyzed to illustrate the variety
of ecologically-important quantities that are easily constructed and
estimated using our model-based estimators of species occurrence.
In particular, we compute site-specific estimates of species
richness that honor classical notions of
species-area relationships. Extensions of our model may be used
to estimate maps of occurrence of individual species and to
compute inferences related to the temporal and spatial dynamics of
biological communities.
Seminar page
Anti-eigen and anti-singular values of a matrix and
applications to problems in statistics
Let $A$ be a $p\times p$ positive definite matrix. A nonzero
$p$-vector $x$ such that $Ax=\lambda x$ is called an eigenvector with
the associated with eigenvalue $\lambda$. Equivalent characterizations
are:
(i) $\cos \theta=1$, where $\theta$ is the angle between $x$
and $Ax$.
item{(ii)} $(x^\prime Ax)^{-1}=xA^{-1}x$.
item{(iii)} $\cos \Phi=1$, where $\phi$ is the angle between
$A^{1/2}x$ and $A^{-1/2}x$.
We ask the question what is $x$ such that $\cos\theta$ as defined in
(i) is a minimum or the angle of separation between $x$ and $Ax$ is a
maximum. Such a vector is called an anti-eigenvector and $\cos\theta$
an anti-eigenvalue of $A$. This is the basis of operator trigonometry
developed by K. Gustafson and P.D.K.M. Rao (1997), {\it Numerical
Range: The Field of Values of Linear Operators and Matrices},
Springer. We may define a measure of departure from condition (ii) as
$\min[(x^\prime Ax)(x^\prime A^{-1}x)]^{-1}$ which gives the same
anti-eigenvalue. The same result holds if the maximum of the angle
$\Phi$ between $A^{1/2}x$ and $A^{-1/2}x$ as in condition (iii) is
sought. We define a hierarchical series of anti-eigenvalues, and also
consider optimization problems associated with measures of separation
between an $r(less than p)$ dimensional subspace $S$ and its transform $AS$.
Similar problems are considered for a general matrix $A$ and its
singular values leading to anti-singular values.
Other possible definitions of anti-eigen and anti-singular values, and
applications to problems in statistics will be presented.
Click Here for a
PDF version of this abstract.
Seminar page
Score Statistics for Mapping Quantitative
Trait Loci
We propose a method to detect the existence of a
quantitative trait loci (QTL) in a backcross population using a score test.Under the null hypothesis of no QTL, all phenotype random variables are independent and identically distributed, and the maximum likelihood estimates (MLEs) of parameters in the model
are usually easy to obtain. Since the score test only uses the MLEs of
parameters under the null hypothesis, it is
computationally simpler than the likelihood ratio test (LRT). Moreover, because the location parameter of the QTL is unidentifible under the null hypothesis, the distribution of the maximum of the LRT statistics, typically the statistic of choice for testing $H_0:$ no QTL, does not have the standard chi-square distribution asymptotically under the null hypothesis. From the simple structure of the score test statistics, the asymptotic null distribution can be derived for the maximum of the square of score
test statistics. Numerical methods are proposed to compute the
asymptotic null distribution and the critical thresholds can be
obtained accordingly. A simple backcross design is used to
demonstrate the application of the score test to QTL mapping.
The proposed method can be readily extended to more complex
situations. This is joint work with Rongling Wu, Samuel S. Wu and
George Casella.
Seminar page
Honest MCMC via Drift and Minorization
I call an MCMC algorithm "honest" if it is possible to calculate
standard errors using a valid central limit theorem along with a
consistent estimate of the (unknown) asymptotic variance. In this
talk, I will explain how drift and minorization conditions on the
underlying Markov chain make honest MCMC possible. (This is joint
work with Galin Jones, Brett Presnell and Jeffrey Rosenthal.)
Seminar page
Accounting for Uncertainty About Variances and Correlations
in Small Area Estimation
Small area estimation often uses a linear model applied to direct
survey small area estimates. The model includes regression effects,
small area random effects (model error), and sampling error. Often
the magnitude of the sampling error greatly exceeds that of the model
error, resulting in considerable uncertainty about properties
(variances, correlations) of the model error based on statistical
inferences from the direct estimates data. In this situation,
Bayesian techniques have important advantages over classical
approaches for making inferences about the true quantities of
interest. This is illustrated for the estimation of state age-group
poverty rates in the Census Bureau's SAIPE (Small Area Income and
Poverty Estimates) project. SAIPE applies a small area model to
direct state poverty estimates from the Current Population Survey
(CPS). Particular issues that arise for the SAIPE state model include
(i) estimating the mean squared error of the model-based estimates,
(ii) assessing whether borrowing strength from CPS estimates for the
previous year or for a different age group can improve the estimates, and
(iii) assessing whether borrowing strength from estimates from
another survey (American Community Survey) will improve the poverty
estimates.
Seminar page
A Penalized Likelihood Approach to Principal Component
Rotation
Principal component analysis, widely used to explore
multidimensional data sets, is susceptible to high sampling variability.
Sampling variability contributes to the difficulty frequently encountered
in interpreting individual principal components. Rotation techniques
(borrowed from factor analysis) and some recent alternatives have
been proposed for producing more interpretable principal components,
but none has a definite relationship to sampling variability.
Principal components can be made simultaneously more stable and interpretable
if estimated via penalized likelihood, with a penalty
term that encourages the components to assume a simpler form.
In contrast to other methods for principal component rotation,
penalized likelihood can be more precisely controlled,
naturally targets components that are ill-defined, and allows assessment of
the statistical plausibility of the modified components. Simple variants
of the penalty term can be used to preferentially rotate the major components
or to respect special structure.
The estimation requires an optimization subject to orthogonality constraints,
for which efficient specialized algorithms have recently been developed. A
Stiefel manifold variant of the Newton algorithm appears to work well for this
problem and has potential applications to related statistical
problems.
Seminar page
Characterizing and Eradicating Autocorrelation in
MCMC Algorithms
In spite of increasing computing power, many statistical problems
give rise to Markov chain Monte Carlo algorithms that mix too slowly
to be useful. Most often, this poor mixing is due to high posterior
correlations between parameters. In the illustrative special case of
Gaussian linear models with known variances, we characterize which
functions of parameters mix poorly and demonstrate a practical way of
removing all autocorrelations from the algorithms by adding Gibbs
steps in suitable directions. These additional steps are also very
effective in practical problems with nonnormal posteriors, and are
particularly easy to implement.
Seminar page
On Some Probabilistic and Statistical Aspects of Discovery of New
Species
Following Chao (Ann Stat., 1984), Chao and Lee (Jour Amer Stat
Assoc, 1992) and Sinha and Sengupta (Cal Stat Assoc Bull, 1993), we
consider the problem of discovery of ALL species in a species search
problem under an infinite population set-up. Assuming that there are N
species with abundance rates p_1 >= p_2 >= etc >= p_N, we work out the
probability PHI(t, N, P_) that more than t successive attempts are needed
to discover all the species. Denoting by phi(p,t,N) the probability
of discovery of ALL species in t successive attempts with the species
having abundance rate p being observed last, we establish the intuitive
claim: phi(p,t,N)>=phi(q,t,N) whenever p <= q, uniformly in t > 0.
Next we establish that PHI(t,N,P_) is the least [uniformly in t > 0]
when the species are equally abundant.
This demonstrates further that the average number of successive attempts
needed to observe all species is the least when they are equally abundant.
Results for a finite closed population will also be stated.
Finally, data analysis aspects will be addressed.
.Seminar page
Practical Small-Sample Inference for Order One Subset Autoregressive
Models via Saddlepoint Approximations
We propose some approaches for saddlepoint approximating distributions
of estimators in single lag subset
autoregressive models of order one. The roots of estimating equations method is the most promising, not requiring that an explicit
expression
for the estimator be available. We find the distributions of the
Burg estimators to be almost coincidental with those of maximum
likelihood. By inverting a hypothesis test, we show how confidence
intervals for the autoregressive
coefficient can be constructed from saddlepoint approximations. A
simulation study reveals the resulting coverage probabilities to be
very close to nominal levels,
and to generally outperform asymptotics-based confidence intervals. The
reason is shown to be linked to near parameter
orthogonality between the autoregressive coefficient and the white
noise variance. Our findings are
substantiated by percent relative error calculations
that show the saddlepoint approximations to be very accurate.
Seminar page
On Bayesian Wombling: Modelling Spatial Gradients
Spatial process models are now widely used for inference in many
areas of application. In such contexts interest often lies in estimating
the rate of change of a spatial surface at a given location in a given
direction. This problem, known as ``wombling'' after a foundational paper
by William Womble (Science, 1951), is encountered in several scientific
disciplines. Examples include temperature or rainfall gradients in
meteorology, pollution gradients for environmental data and surface
roughness assessment for digital elevation models. Since the spatial
surface is viewed as a random realization, all such rates of change are
random as well. We formalize the notions of directional finite difference
processes and directional derivative processes building upon the concept
of mean square differentiability. and obtain complete distribution theory
results for stationary Gaussian process models. We present inference under
a Bayesian framework which, in this setting, presents several modelling
advantages. Illustrations are provided with simple and complex spatial
models.
Seminar page
Monte Carlo Methods for Posterior Distributions Associated with
Multivariate Student's $t$ Data
Let $\pi$ denote the posterior distribution that results when a random
sample of size $n$ from a $d$-dimensional location-scale Student's $t$
distribution (with $\nu$ degrees of freedom) is combined with the
standard non-informative prior. We consider several Monte Carlo
methods for sampling from the intractable posterior $\pi$, including
rejection samplers and Gibbs samplers. Special attention is paid to
the MCMC algorithm developed by van Dyk and Meng (JCGS, 2001) who
provided considerable empirical evidence suggesting that their
algorithm converges to stationarity much faster than the standard data
augmentation Gibbs sampler. We formally analyze the relevant Markov
chain underlying van Dyk and Meng's algorithm and show that for many
$(d,\nu,n)$ triples it's geometrically ergodic.
Seminar page
Bayesian Nonparametric Binary Regression using a Gaussian process prior
Binary response ($Y$) data corresponding to a continuous covariate is often
encountered in practice. In the classical parametric approach, the response
probability $p(z)=P(Y=1|z)$, as a function of the covariate $z$, is modeled as
a known function $H(z,\theta)$, except for a finite dimensional parameter
$\theta$. The link function $H$ is usually taken to be a cumulative
probability distribution. The parametric approach may however fail to
accommodate a variety
of data. Nonparametric approach consists of estimating the whole link
function. Assuming that the link function $H$ is a cumulative probability
distribution, a Bayesian approach can be followed by putting a Dirichlet
process prior (or a smoother version of it) on $H$. If $H$ is not a monotone
function (which happens in toxicity studies and some other examples), the above
approach may lead to serious bias. In this talk, we present a method of
estimating the response probability based on local smoothness assumptions
only, without any monotonicity restriction. We follow a Bayesian approach by
putting a (transformed) Gaussian process prior on the response probabilities.
Such a prior has large support in that no shape is ruled out by the prior. We
describe a Markov chain Monte Carlo method to compute the posterior mean of
the response probability function $p(z)$ and posterior credible bands. If the
design points satisfy certain requirements (which are satisfied by the
regularly spaced design, or with large probability, by random designs) and the
true response probability $p_0(z)$ is twice continuously differentiable and
never 0 or 1, we show that the posterior distribution of the respsonse
probability function $p$ is consistent for the $L_1$-distance $d(p_1,p_2)=\int
|p_1(z)-p_2(z)| dz$. In order to establish this result, we also extend the
classical theorem of Schwartz (1965, Z. Wahr. Verw. Gebiete 4,
10--26) to independent non-identically distributed observations. It is shown
that consistency holds if the prior assigns positive probabilities to some
suitable neighborhoods, and certain uniformly exponentially powerful tests
exist on some appropriate sieves. Some deep properties of Gaussian processes
are exploited to show that the required conditions hold. A simulation study
shows that our method performs well in comparison with other methods of
estimation available in the literature. We also apply our method to some real
data.
The talk is based on a joint work with Nidhan Choudhuri of Case Western
Reserve University, and Anindya Roy of University of Maryland, Baltimore
County.
Seminar page
Optimal Design and Efficient Analysis of Two-Phase
Case-Control and Case-Cohort Studies
Two phase or double sampling was introduced by Neyman as a technique for drawing efficient stratified samples. We consider this design initially
for the case of a binary outcome variable. When strata and phase two sampling fractions depend both on outcome and covariates, care must be
taken with the analysis of data from the resulting biased sample. The standard survey sampling (Horvitz-Thompson) approach involves
weighting the log-likelihood contributions by the inverse sampling fractions. "Psuedolikelihood" estimates maximize a product of biased sampling
probabilities and are typically more efficient. Only recently have semiparametric efficient and maximum (profile) likelihood procedures become
available.
We first develop asymptotic information bounds and the form of the efficient score and influence functions for coefficients in semiparametric
regression models fitted to two phase stratified samples, and point out the relationship of this work to the information bound calculations of
Robins and colleagues. By verifying conditions of Murphy and van der Vaart for a least favorable parametric submodel, we provide asymptotic
justification for statistical inference based on profile likelihood.
Using data from the National Wilms Tumor Study, and simulations based on these data, we then demonstrate the advantages of careful selection
of the phase two sample and use of an efficient analysis method. One can limit collection of data on an "expensive" covariate to a fraction of the
phase one sample, yet almost exactly reproduce the results that would have been obtained with complete data for everyone. The basic principles
include: (i) fine stratification of the phase one sample using outcome and available covariates; (ii) use of a near optimal "balanced" rather than a
simple case-control sample at phase two; and (iii) fully efficient estimation. These same basic principles extend to the design and analysis of two
phase exposure stratified case-cohort studies, which involve a censored "survival" outcome and on which much work is in progress.
Portions of this work are joint with Nilanjan Chatterjee, Brad McNeney and Jon Wellner.
Seminar page
The Value of Long Term Follow-up: Lessons from the
National Wilms Tumor Study
Wilms tumor (WT) is an embryonal tumor of the kidney that affects approximately one child in every 10,000. During the 20th century, cure rates
increased from 10% to 90% as first radiation and then chemotherapy joined surgical removal of the diseased kidney as standard treatment. The
National Wilms Tumor Study Group (NWTSG) conducted five protocol studies and registered nearly 10,000 patients during 1969-2002. For the
past 15-20 years the study enrolled 70-80% of the 550 cases estimated to occur annually in North America. Its focus has been the
identification of patient subgroups at high and low risk of relapse, and the substitution of combination chemotherapy for radiation therapy, with
a primary goal to reduce long term complications while producing the maximum number of cures.
The NWTSG Data and Statistical Center, located in Seattle for the 33 year duration of the study, played a major role in this effort. Systematic
follow-up of surviving patients documented the long term "costs of cure" and the wisdom of reserving the most toxic treatments for those
who actually needed them. Secondary cancers, for example, which once affected 1.6% of Wilms tumor survivors by 15 years from diagnosis,
have been much reduced since 60% of patients no longer receive radiation therapy.
Statistical study of the NWTSG database has challenged prevailing theories for the genetic origins of WT and led to new hypotheses for
investigation by molecular biologists.
This talk considers three issues: (1) whether all bilateral and multicentric WT are hereditary; (2) whether Asians lack WT caused by loss of
imprinting of the insulin growth factor gene IGF2; and (3) whether constitutional deletion of the WT gene WT1 in patients with the WT-aniridia
(WAGR) syndrome has a less severe effect on renal function than a point mutation in WT1 in patients with the Denys-Drash syndome. Key
factors that facilitated these statistical contributions include a compulsive effort to maintain continuity in data collection and follow-up and a
constant search for ways to use the clinical and
epidemiologic data to answer questions of basic
biological significance.
Seminar page