Interdisciplinary Seminar: Joseph Schafer, Census Bureau

Modeling Coarsened Categorical Variables: Techniques and Software

Presented by Joseph Schafer, United States Census Bureau

Friday, March 24
11:00 a.m. ET
Gentry 144

Coarsened data can express intermediate states of knowledge between fully observed and fully missing. For example, when classifying survey respondents by cigarette smoking behavior as 1=never smoked, 2=former smoker, or 3=current smoker, we may encounter some who reported having smoked in the past but whose current activity is unknown (either 2 or 3, but not 1). Software for categorical data modeling typically provides codes for missing values but lacks convenient ways to convey states of partial knowledge. A new R package cvam: Coarsened Variable Modeling, extends R's implementation of categorical variables (factors) and fits log-linear and latent-class models to incomplete datasets containing coarsened and missing values. Methods include maximum likelihood estimation using an expectation-maximization algorithm, approximate Bayesian and Bayesian inference via Markov chain Monte Carlo. Functions are also provided for comparing models, predicting missing values, creating multiple imputations, and generating partially or fully synthetic data. In the first major application of this software, data from the U.S. Decennial Census and administrative records were combined to predict citizenship status for 309 million residents of the United States.

Speaker Bio:

Joseph L. Schafer earned a Ph.D. in Statistics from Harvard University in 1992. For two decades, he served on the faculty of the Department of Statistics at The Pennsylvania State University, and is now Senior Mathematical Statistician for Analytic Modeling in the Research and Methodology Directorate at the United States Census Bureau. He has authored three books and dozens of articles in statistics, biostatistics, survey methodology, quantitative psychology, prevention research and other fields. His research interests include missing data, computational statistics, software development, Bayesian analysis, longitudinal data, latent-variable modeling and causal inference. He currently serves on the Federal Committee for Statistical Methodology.