___________________________________________________________________________________________________________________________________________

 

ST 732 - Spring 2002

Applied Longitudinal Data Analysis

Marie Davidian, 209G Patterson Hall

 

Course objective:

To introduce students to statistical models and methods for the analysis of longitudinal data, i.e. data collected repeatedly on experimental units over

time (or other conditions).

 

Course prerequisites:

ST 512 (formerly ST 701), Experimental Statistics for Biological Sciences II, or equivalent. Thus, students should be familiar with basic notions of

probability, random variables, and statistical inference, analysis of variance, and (multiple) linear regression. Familiarity with matrix algebra is also

useful; we will review this at the beginning of the course and make use of matrix notation and operations throughout. ST 512 makes heavy use of

SAS; thus, students are expected to have had some exposure to the use of SAS. The course is meant to be accessible both to non-majors and majors.

Thus, the underlying mathematical theory will not be stressed, and the main focus will be on applications. Please see the instructor if you have

questions about the suitability of your background.

 

Course topics:

Preliminaries: Introduction, Review of matrix algebra, random vectors, multivariate normal distribution, review of linear regression

Classical methods for normally distributed, balanced repeated measurements: Univariate repeated measures analysis of variance, Multivariate

repeated measures analysis of variance, Drawbacks and limitations of classical methods

Methods for normally distributed, unbalanced repeated measurements: General linear models and models for correlation, Random coefficient models,

Linear mixed effects models, Population-averaged vs. Subject-specific modeling

Methods for non-normally distributed, unbalanced data: Probability models for discrete and continuous nonnormal data and generalized linear models,

Generalized estimating equations for population-averaged models

Advanced topics (quick overview): Generalized linear mixed effects models, Nonlinear mixed effects models, Missing data mechanisms

____________________________________________________________________________________________________________________________________________

Francesca Dominici

Associate Professor
Department of Biostatistics
Johns Hopkins Bloomberg School of Public Health

 

LONGITUDINAL DATA ANALYSIS 2004course website

___________________________________________________________________________________________________________________________________________

 

http://www.cacr.ca/news/2002/0204pahwa.htm

 

Statistical Models for the Analysis of Longitudinal Data

1blank.gif (45 bytes)

Dr. Punam Pahwa, Ph.D., Assistant Professor, Dept. of Community Health and Epidemiology, Royal University Hospital, Saskatoon, Saskatchewan T. Blair, M.Sc., Ph.D. Candidate (yr2), College of Medicine, University of Saskatchewan

  

Longitudinal Studies
Definition

A longitudinal study is defined as a study in which the response for each experimental unit in the study is observed on two or more occasions. The defining feature of a longitudinal data set is repeated observations on experimental units. Longitudinal data require special statistical methods because the set of observations on one subject tends to be intercorrelated. These correlations must be taken into account to draw valid scientific inferences.

Advantage of longitudinal studies over cross-sectional studies

The major advantage of longitudinal study over cross-sectional study is that longitudinal study can separate the cohort and age effects in population studies. Age effect is the changes over time within individuals. Cohort effect is the differences among people in their baseline values. Longitudinal studies can distinguish these age and cohort effects while cross-sectional studies cannot. In cross-sectional data, only a single response is available for each of the experimental units (e.g. human subjects, animals, or plants).

The goals of longitudinal research

  1.  To characterize patterns subject responses (e.g growth, decline in lung function, increase in blood pressure) over "time". 
  2. To investigate the effects of important covariates on these patterns.

There are two types of covariates in longitudinal studies

  1. Non-time-varying covariate (e.g. gender, race) - Between subjects 
  2. Time-varying covariates (e.g. age, weight, income, smoking status, exposure) - Within-subjects

Note - It is possible for a covariate to change both within subject and between subjects. For example : consider a study where children of ages from 6 to 12 are each followed for 5 years. Here, information about how changes in response can be obtained both from comparisons between subjects and from comparisons between measurement for a single subject.

When each subject is scheduled to be measured at the same set of times (say, t1, t2, …, tn), then resulting data is referred as equally-spaced or balanced data. When subjects are each observed at different sets of times and/or there are missing data, then resulting data is referred as a unequally spaced or unbalanced data set.

Characteristics of Longitudinal Studies

Common characteristics of longitudinal studies are: (1) correlated responses; (2) observations taken at unequal time points and (3) missing observations. 

The analysis of longitudinal data should therefore take into account firstly, the within subject correlation, secondly the measurements taken at unequal time intervals and finally the missing observations. Repeated measures analysis of variance can be used to analyse longitudinal or repeated measures data for balanced study design, i.e. when all subjects are measured at equal time points and there are no missing data. It is very rare to find balanced data sets in longitudinal studies so it is necessary to use some alternative techniques which can handle unbalanced data. 

The generalized estimating equations (GEE) approach re-invented by Liang and Zeger (1986) is based on the multivariate quasi-likelihood theory, which can handle the complexities of longitudinal studies. In longitudinal studies four type of responses are encountered: continuous, discrete, count and survival. GEE's are based on the theory of "quasi-likelihood" which is an extension of maximum likelihood estimation. GEE's are also an extension of the theory of Generalized Linear Models (GLM's). The continuous and discrete longitudinal data can be analyzed by using the GEE approach.

The GEE Approach

The GEE approach is a general method for fitting mathematical models to data involving repeated measurements on the same subject or cluster. The responses may be either discrete or continuous. This method allows the user to account for intra-subject correlations, often treated as nuisance parameters, among repeated measurements on the same subject. Different subjects can have different numbers of repeated measurements.

The correlations are specified in the form of a working correlation matrix, which can have a variety of possible structures. The method estimates model parameters by iteratively solving a system of equations based on quasi-likelihood distributional assumptions. The user can choose from a variety of model forms by specifying a link function, thus the model form can be either logistic, log-linear, or linear.

While modelling longitudinal data, the primary objective of regression analysis is to identify the relationship between the expected value E(Y) of the response variable Y and the covariates X1, X2, … , Xp. Modelling the correlation structure is of secondary importance, however it is necessary to take into account any intra-subject response correlation when making statistical inferences about the regression coefficient $1, $2, … , $p. If we do not take into account the intra-subject correlation, then such statistical inferences can be seriously in error. Some of the most commonly used within-subject correlation matrices are as follows:

  1. independence, i.e. repeated observations are uncorrelated. 
  2. unspecified (unstructured), i.e. correlations within any two responses are unknown and need to be estimated. 
  3. exchangeable, i.e. correlation between any two responses of the ith individual is the same.  
  4. autoregression of first order [AR(1)] assuming the sampling interval length is the same between any two observations 
  5. Autoregression of the first order assuming continuous unequally spaced sampling intervals {t1, t2, ... , tn}

Linear Models For Continuous Response Longitudinal Data

Three types of models can be fitted for continuous response : Marginal Models; Transitional Models (or Conditional) Models; and Random-Effects Models

Marginal Models

When a population is of primary interest, fitting marginal models is the most appropriate. In these models, the population-averaged response is modelled as a function of the covariates. The regression coefficients are interpreted for the population rather than for individuals, so these are known as "population-averaged" (PA) models.

Transitional Linear Models

When the time dependence is central, models for the conditional distribution of Yij given Yij-1, Yij-2, … , Yi1 may be more appropriate. These are also known as conditional models. Linear models for the conditional mean Yij given the observed value Yi,j-1 of the response immediately preceding Yij, i.e. This is a linear model for the conditional mean of Yij given (or conditional on) the observed value Yi(j-1) of the response immediately preceding Yij. The above model is a model for the "transition" from Yi(j-1) to Yij. It is commonly called a first-order autoregressive or AR(1) model. One can have second-order, third order or higher order autoregressive models.

Random Effects Linear Models

Random-effects/mixed-effects models are more appropriate for the study of an individual's growth. These models are known as "subject-specific" (SS) models. Example : The model 

is a Random-Effects Linear model, because it has one random effect (b0i) in addition to the usual residual error term (eij). Above model is also known as a Mixed-Effects Model because it involves a "MIX" of both fixed (or Non-Random) effects (b0 and b1) and an extra random effect (b0i). These models are also known as variance components models.

Non-Linear Models For Discrete Response Longitudinal Data

We can fit three types of non-linear models for discrete response: Marginal Model; Transitional (or conditional) models; and Random-effects models.

Marginal models for discrete outcomes

As previously described for marginal linear models, the goal here is to model the marginal mean or expectation E(Yij) as a function of important covariates. Most commonly used models for discrete outcomes are logistic models for dichotomous and polytomous outcomes, and Poisson regression models for counts. For the marginal model, regression coefficients have population-averaged interpretation. The parameter $1 describes the effect of x1ij on the marginal expectation of the Y's. The prevalence odds ratio e$1, which is the ratio of the odds of disease among subjects who have a particular characteristic compared to the odds of disease among subjects who do not have a characteristic can be calculated.

The four steps involved in the longitudinal data analysis were: 

  1. the choice of the model (specifying the link function, which describes the model form that you wish to use); 
  2. the choice of the variance-covariance structure ( specifying the working correlation structure for each subject, e.g. independence, exchangeable, stationary, autoregressive) 
  3. assessing the goodness-of-fit of the model; and 
  4. assessing the goodness-of-fit of the variance covariance structure.

Computer Programs: SAS MACRO, SAS PROC GENMODE OR SAS PROC MIXED

References

  1. Zeger, S.L., and Liang, S.L., Liang (1986). Longitudinal data analysis using generalized linear models. Biometrila, 73, 13-22. 
  2. Harville, D.A. (1977). MAximum Likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72:320-340. 

Additional Reading
Zeger, S.L., and Liang, K.Y. (1992). An overview of methods for the analysis of longitudinal data. Statistics in Medicine, 11 : 1825-1839.

 

 

 

____________________________________________________________________________________________________________________________________________

 

http://www.soc.surrey.ac.uk/sru/SRU28.html

 

Longitudinal Research in the Social Sciences

by Elisabetta Ruspini

Dr Elisabetta Ruspini has a PhD in Sociology and Social Research. She is a post-doctoral research fellow at the Department of Sociology, University of Padova. In 1999 she was an International Visiting Fellow in Social Research Methods at the University of Surrey. Her current research focuses on the feminisation of poverty and women’s poverty dynamics in Italy in a comparative perspective.

Her research interests include gender issues, comparative welfare research, social and family policies, poverty, and the study of living conditions. Within the methodological field, her main interests are longitudinal data analysis and the design and collection of complex data sets such as household panel surveys.

She has published a number of articles and contributed papers to national and international conferences in the fields of longitudinal research and research on poverty.

  • Longitudinal research concerns the collection and analysis of data over time. Longitudinal data are essential if the research purpose is to measure social change: they allow a diacronic analysis of the incidence of conditions and events.
  • Several types of data may be regarded as longitudinal: repeated cross-sectional studies; prospective studies, retrospective studies. Because longitudinal research is a broad term, methods for the analysis of social change may also vary substantially.
  • Longitudinal research can potentially provide fuller information about individual behaviour; however, the use of such data poses crucial theoretical and methodological problems.

 ‘Longitudinal’ is a broad term. It can be defined as research in which:

  • data are collected for each item or variable for two or more distinct periods;
  • the subjects or cases analysed are the same, or at least comparable, from one period to the next; and
  • the analysis involves some comparison of data between or among periods (Menard 1991:4).

There are a number of different designs for the construction of longitudinal evidence: repeated cross-sectional studies; prospective studies, such as household panel surveys or cohort panels; and retrospective studies, such as oral histories and life and work histories.

Repeated cross-sectional studies

In the social sciences, cross-sectional observations are the form of data most commonly used for assessing the determinants of behaviour (Coleman 1981; Davies 1994; Blossfeld and Rohwer 1995). However, the cross-sectional survey, because it is conducted at just one point in time, is not suited for the study of social change. It is therefore common for cross-sectional data to be recorded in a succession of surveys at two or more points in time, with a new sample on each occasion. These samples either contain entirely different sets of cases for each period, or the overlap is so small as to be considered negligible. Where cross-sectional data are repeated over time with a high level of consistency between questions, it is possible to incorporate a time trend into the analysis. Examples of repeated cross-sectional social surveys are: the UK’s General Household Survey and Family Expenditure Survey, and the EU’s Eurobarometer Surveys.

Prospective designs

The temporal data most often available to social researchers are panel data, in which the same individuals are interviewed repeatedly across time. Variations of this design (Buck et al. 1994: 21-22) include:

Representative Panels with a random sample of respondents and repeated data collections at fixed intervals (typically from 2-3 months to a year). Thus panel surveys trace individuals at regular discrete points in time. The fundamental feature they offer is that they make it possible to detect and establish the nature of individual change. For this reason, they are well-suited to the statistical analysis of both social change and dynamic behaviour. Among the best known prospective panel studies are the US Panel Study of Income Dynamics (PSID), the British Household Panel Study (BHPS) and the German Socio-Economic Panel (SOEP).

Cohort Panels can be considered as a specific form of panel study that takes the process of generation replacement explicitly into account. A cohort is defined as those people within a geographically or otherwise delineated population who experienced the same significant life event within a given period of time. Researchers select an age group, or some subset of an age group, and then administer a questionnaire to a sample or to the whole group. Thus, one or more generations are followed over their life course. The interest is usually in the study of long term change and in individual development processes: such studies typically re-interview every five years. If, in each particular generation the same people are investigated, a cohort study amounts to a series of panel studies; if, in each generation, at each period of observation, a new sample is drawn, a cohort study consists of a series of trend studies (Hagenaars 1990). Examples are the UK National Child Development Study and the German Life History Study.

Linked Panels In these cases data items which are not collected primarily for panel purposes (Census or administrative data) are linked together using unique personal identifiers. This is the least intrusive method of collecting longitudinal data (Buck et al. 1994).

Retrospective studies (event oriented observation design) All the data types discussed so far have been recorded with reference to fixed and predetermined time points. But, for many processes within the social sciences, continuous measurement of qualitative variables seems to be the most suitable method of empirically assessing social change. When data are recorded in a continuous time, the number and sequence of events and the duration between them can all be calculated. Data recorded in continuous time are often collected retrospectively via life history studies that cover the whole life course of individuals. The main advantage of this approach lies in the greater detail and precision of information (Blossfeld and Rohwer 1995). A good example is the UK 1980 Women and Employment Survey, which obtained very detailed work histories from a nationally representative sample of women of working age in Britain.

Strictly speaking, longitudinal studies are limited to prospective studies, while retrospective studies have been defined as a quasi-longitudinal design, since they do not offer the same strengths for research on causal processes (Hakim 1987:97).

Because several types of data may be regarded as longitudinal, methods for the analysis of social change may also vary substantially: from time-series techniques for repeated cross-section data to logistic and log-linear models; from structural equation models to longitudinal multilevel methods; from regression analysis to event history analysis (Davies and Dale 1994).

Advantages and limitations of longitudinal data

Longitudinal data allow the analysis of duration; permit the measurement of differences or change in a variable from one period to another, that is, the description of patterns of change over time; and can be used to locate the causes of social phenomena (Menard 1991:5) and sleeper effects, that is, connections between events that are widely separated in time (Hakim 1987).

Insights into processes of social change can thus be greatly enhanced by making more extensive use of longitudinal data. Dynamic data are the necessary empirical basis for a new type of dynamic thinking about the processes of social change (Gershuny 1998). The possibility of developing research based on longitudinal data also builds a bridge between ‘quantitative’ and ‘qualitative’ research traditions and enables re-shaping of the concepts of qualitative and quantitative (Ruspini 1999). Longitudinal surveys usually combine both extensive and intensive approaches (Davies and Dale 1994). Life history surveys facilitate the construction of individual trajectories since they collect continuous information throughout the life course. Panel data trace individuals and households through historical time: information is gathered about them at regular intervals. Moreover, they often include relevant retrospective information, so that the respondents have continuous records in key fields from the beginning of their lives. As an example, the British Household Panel Study has taken the opportunity (over the first three waves) to get a very good picture of respondents’ lives by asking for life-time retrospective work-histories, and marital and fertility histories, hence investigating both illuminating and vital areas of the lives of those who make up a representative sample of the households of Britain.

Taking the German Socio-Economic Panel as another example, two calendars are included in the core questionnaires: an activity calendar that, on a monthly basis, records participation in schooling, vocational education, military service, full-time and part-time employment, unemployment, homemaking and retirement for the previous year; and an income calendar where respondents indicate, also on a monthly basis, whether they have received income from various sources in the past year and the average monthly amount received from each source (Burkhauser 1991).

Thus, longitudinal analysis presupposes the development of a methodological mix where neither of the two aspects alone is sufficient to produce an accurate picture of social dynamics (Mingione 1999).

However, although dynamic data have the potential to provide richer information about individual behaviour, their use poses theoretical and methodological problems. In addition, longitudinal research typically costs more and can be very time-consuming. The principal limitations of the repeated cross-sectional design are its inappropriateness for studying developmental patterns within cohorts and its inability to resolve issues of causal order. Both of these limitations result directly from the fact that in a repeated cross-sectional design, the same cases are neither measured repeatedly nor for multiple periods (Menard 1991). Thus, more data are required to characterise empirically the dynamic process that lies behind the cross-sectional snapshot (Davies 1994).

Concerning panel data, the main operational problems with prospective studies (other than linked panels) (Magnusson and Bergmann 1990; Menard 1991; Duncan 1992, Rose 1993; Blossfeld and Rohwer 1995) are:

Panel attrition

If the same set of cases is used in each period, there may be some variation from one period to another as a result of missing data (due to refusals, changes of residence or death of the respondent). Such systematic differences between waves cause biased estimates. For example, a major problem in most surveys on poverty is the under-sampling of poor people: they are hard to contact (and therefore usually undersampled in the first wave of data) and hard to retain for successive annual interviews. Even though weight variables could be used to mitigate under-representation, it is difficult to assess the real efficiency of such weights.

Course of events

Since there is only information on the states of the units at predetermined survey points (discrete time points), the course of the events between the discrete points in time remains unknown;

Panel conditioning

Precisely because in a panel survey the same subjects are repeatedly interviewed, it is possible that responses given in one wave will be influenced by those given in the previous waves (Trivellato 1999). Unwillingness to participate in the study may also result from continued study and may result in attrition. Yet another possibility is that respondents will change as a result of participation in the survey (Menard 1991).

Consequently, the potential of panel data can only be fully realised if such data meet high quality standards (Duncan 1992; Ghellini and Trivellato 1996). In particular, Trivellato (1999) stated that for a panel survey to be successful, the key ingredients are a good initial sample and appropriate following rules, that is, a set of rules that permit mimicing the population that almost always changes in composition over time. Taking the BHPS as an example, because the BHPS is a household panel study which tracks household formation and dissolution, individuals may join and leave the sample. Thus, the study has a number of following rules determining who is eligible to be interviewed at each wave. New eligibility for sample inclusion could occur between waves in the following ways: 1) A baby is born to an Original Sample Member (OSM); 2) An OSM moves into a household with one or more new people; 3) One or more new people move in with an OSM (Freed Taylor et al. 1995).

The drawback of linked panels is that they can only provide a very limited range of information and often on a highly discontinuous temporal basis (as in the case of a Census). Moreover, such panels suffer from problems of confidentiality and of data protection legislation, so there is often only very limited access (Buck et al. 1994). Even if retrospective studies have the advantage of usually being cheaper to collect than panel data, they suffer from several limitations that are increasingly being acknowledged (Davies and Dale 1994; Blossfeld and Rohwer 1995):

  1. recall bias: retrospective questions concerning motivational, attitudinal, cognitive or affective states are particularly problematic because respondents find it hard to accurately recall the timing of changes in these states;
  2. there is a limit to respondents’ tolerance for the amount of data that can be collected on one occasion;
  3. retrospective studies must be based on survivors. Those subjects who have died or migrated will, necessarily, be omitted and biases may arise; retrospective studies can also misrepresent specific populations.

Conclusion

The use of longitudinal data (both prospective and retrospective) can ensure a more complete approach to empirical research. Longitudinal data are collected in a time sequence that clarifies the direction as well as the magnitude of change among variables. However, the world of longitudinal research is quite heterogeneous. Some important general suggestions are (Menard 1991):

  1. If the measurement of change is not a concern, if causal and temporal order are known, or if there is no concern with causal relationships, then cross-sectional data and analysis may be sufficient. Repeated cross-sectional designs may be appropriate if a problem of panel conditioning as a result of repeated interviewing or observation in a prospective panel is anticipated.
  2. If change is to be measured over a long span of time, then a prospective panel design is the most appropriate, because independent samples may differ from one another unless both formal and informal procedures for sampling and data collection are rigidly replicated for each wave of data. Within this context, it is important to remember that a period of time needs to occur before it is feasible to do an analysis of social change: a consistent number of waves is necessary to permit in-depth long term analyses to be carried out.
  3. If change is to be measured over a relatively short time (weeks or months), then a retrospective design may be appropriate for data on events or behaviour, but probably not for attitudes or beliefs.
  4. In order to combine the strengths of panel designs and the virtues of retrospective studies, a mixed design employing a follow-up and a follow-back strategy seems appropriate (Blossfeld and Rohwer 1995).

Finally, due to the complexity of longitudinal data sets, the user documentation is crucial for the researcher. It should contain essential information required for the analysis of the data (including details of fieldwork, sampling, weighting and imputation procedures) and information to assist users in the linking and aggregating data across waves. The documentation should both make the analysis easier and more straightforward and help evaluate data quality.

____________________________________________________________________________________________________________________________________________

http://www.stat.missouri.edu/longitudinal.html

Longitudinal Data Analysis
June 10-14, 1997

Department of Statistics, University of Missouri-Columbia

After the close of the conference, Nan Laird and David Giltinan were kind enough to make a large number of data sets and related material available to participants and to others interested in longitudinal data analysis. Thanks to both of them for the material below. Please send comments or questions to speckman@stat.missouri.edu.

  • Datasets and SAS examples from Nan Laird (See the README file for the contents.)
  • Material from David Giltinan
    • Slides from lectures on Bayesian modeling and analysis. Postscript file or compressed postscript file.
    • "Extending the Linear Mixed Effects Model." Slides from a short course by Marie Davidian and David Giltinan. Postscript file or compressed postscript file.
    • Worked examples from "Extending the Linear Mixed Effects Model" by Davidian and Giltinan. Postscript file or compressed postscript file describing the examples. The data set file.
    • "Sensitivity analysis in fitting mixed effects models." Slides from a talk by David Giltinan. Postscript file or compressed postscript file.

 

____________________________________________________________________________________________________________________________________________