| Title: An Introduction to the Probabilistic Modeling of Text |
|
Shibasish Dasgupta (Sep. 21)
The management of large and growing collections of information is a central goal of modern statistical science. Data repositories of texts have become widely accessible, thus necessitating good methods of retrieval, organization, and exploration. Probabilistic models have been paramount to these tasks, used in settings such as text classification, information retrieval, text segmentation, information extraction etc. These methods entail two stages:
(1) Estimate or compute the posterior distribution of the parameters of a probabilistic model from a collection of text; &
(2) For new documents, answer the question at hand (e.g., classification, retrieval) via probabilistic inference.
The goal of such modeling is document generalization. Given a new document, how is it similar to the previously seen documents? Where does it fit within them? What can one predict about it? Efficiently answering such questions is the focus of the statistical analysis of document collections.
In this talk, I’ll consider the problem of modeling text corpora. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments. I’ll discuss about the basic methodology for text corpora which successfully deployed in modern Internet search engines.
schedule
|
| Title: Posterior Consistency in Bayesian Regression Models |
|
Doug Sparks (Sep. 28)
Consistency is among the most fundamental properties that we expect to be
satisfied by any reasonable statistical estimator. While Bayesian methods
have expanded to cover increasingly diverse types of data, it is often
taken for granted that these Bayesian procedures are consistent in the
frequentist sense. We will examine the circumstances under which the
Bayes estimator is consistent for a wide variety of regression models,
including interesting cases where the Bayes estimator is not consistent.
Results and concepts from inference and probability will be explained as
needed in order to make the topic accessible to all students.
schedule
|
| Title: The Receiver Operating Characteristic Curve (A Brief Introduction) |
|
Claudio Fuentes (Oct. 5)
Consider medical tests with results that are not simply positive
or negative, but that are measured on a continuous or ordinal scale.
Assume that larger values of the test results, say Y, are more indicative
of a disease. Then, the values of Y are needed to make a dichotomous
decision, namely, whether the disease is present or not. This problem is
fundamental to the evaluation of medical tests and the choice of a
suitable threshold for the values of Y is crucial. Implicitly, the choice
of a threshold depends on the trade-off that is acceptable between failing
to detect a disease and falsely identifying the disease with the test.
In the context of the problem, the Receiver Operating Characteristic
(ROC) curve is among the best developed statistical tools to describe the
range of trade-offs that can be achieved by the test and evaluate its
performance.
Although ROC curves are nowadays widely used in medicine and related
fields, its origin goes back to the 50's and, since then, they have been
extensively used in signal detection theory among other disciplines.
In this talk, I will present a brief introduction to the topic, with some
emphasis in a few properties and practical difficulties associated to it.
schedule
|
| Title: Small Area Estimation when Auxiliary Information is Measured with Error |
|
Meixi Guo (Oct. 12)
The problem of Small area estimation occur when the sample is not large
enough to support direct estimates of adequate precision, therefore people
turn to the use of indirect model-based estimates for producing small area
estimates. In the context where auxiliary information used in the model is
measured with error, which is quite common in practice, the usual small
area estimator which ignore the measurement error might be worse than
using the direct estimator. There's an interesting paper by Ybarra and
Lohr(2008) considered such circumstances but did not provide second-order
unbiased estimator for the MSE of the EBLUP, so we propose alternative
approaches such as profiling likelihood and integrated likelihood to try
to develop second-order unbiased MSE of the EBLUP by using Taylor's
expansion.
schedule
|
| Title: The Dirichlet Process and a Multivariate Extension |
|
Jeremy Gaskins (Oct. 19)
Bayesian nonparametrics is an important and growing field of statistics,
which is concerned with making Bayesian-style inference without making
strong distributional assumptions about model parameters. The Dirichlet
Process has been the foundation of Bayesian nonparametrics, in part,
because it encourages clustering of the parameter under consideration.
The first portion of the seminar will introduce (or recall) the
Dirichlet Process and a few of its key features. Unfortunately, the
Dirichlet Process can have undesirable behavior in a multivariate
setting. We will introduce the Matrix Stick-Breaking Process (MSBP), as
introduced by Dunson, Xue, Carin 2008, as a remedy. The talk is
intended to be accessible to students with no previous experience with
the Dirichlet Process.
schedule
|
| Title: Bayesian Model Selection For Incomplete Data Using The Posterior Predictive Distribution |
|
Arkendu Chatterjee (Oct. 26)
Model choice is a fundamental and much discussed activity in the
analysis of data sets. When several parametric models are under
consideration, we need to determine how well they fit the observed data.
Model selection problem consists of distributions of various quantities by
considering probability model for the observables Y conditioned on each
model "m" and the parameter vector. We chose the model with the best value
for the model selection criterion.
We have explored the use of posterior predictive loss criterion for model
selection for incomplete Longitudinal Data. We show that a straight
forward extension of the Gelfand and Ghosh(1998) criterion to incomplete
data introduces extra term, in addition to the Goodness Of Fit term and
Penalty term, that compromises the criterion. We have proposed an
alternative and explored it via simulations and on a real data set.
Key Words: Posterior Predictive Distribution, DIC, Pattern Mixture Model,
Selection Model.
schedule
|
| Title: Asymptotic Variance Evaluations in Discrete Markov Chains |
|
Nabanita Mukherjee (Nov. 2)
Markov chain Monte Carlo (MCMC) methods have become widely used in
various statistical applications as well as
in theoretical approaches to statistical computing. The motivation
behind performing this computer based simulation
method is due to the possible intractable nature of the distribution
of the quantity of interest. Suppose we are interested
in finding the expected value of a function f, of a random variable X
with $\pi$ as the probability distribution and
in many cases $\pi$ is known only up to a normalizing constant.
For a given $\pi$, there are many Markov chains that preserve the same
stationary distribution. So, orderings defined
on Markov state spaces with a specified stationary distribution guide
us in choosing one Markov chains over another,
in terms of lower asymptotic variance.
We propose different methods of constructing a better Markov chain
from a given chain, in terms of Peskun ordering
(Peskun, 1973). As in the method of constructing a better chain,
preserving stationarity is very delicate, the Metropolis-Hastings
algorithm comes to the rescue. We also propose an algorithm for
getting the Optimal transition matrix which does not require
the knowledge of normalizing constant of $\pi$.
schedule
|