Imputing scales using parcels of items as auxiliary variables
Posted on October 14 - 2015
Multiple imputation when variables are scales generated from several items can be challenging. Fortunately, to impute every single item is not the only way to solve this problem. There are some practical and theoretically attractive alternatives! In this post, I show a simple implementation of what Enders (2010) calls duplicated-scale imputation. The specific method I show here was proposed by Eekhout et al. (2011). Thanks Iris Eekhout for replying my e-mails!
For each scale, I define a number (or proportion) of items (let’s say p) to create parcels (i.e., average of items). These parcels are used as auxiliary variables to impute the original scales. There are different ways to define parcels. I implemented a solution: see the function rowscore available in my R package sdazar.
The function rowscore select p items with less missing data. For each case (row), it computes the parcels using the available information of the selected items. If only one item has information, only that one will be used. If there are more than one item with valid data, it will average the available items. If there are no items available, it will pick p items from the rest of items to impute the original scale. In this particular example I created parcels using half of the items:
The idea of using a proportion of the original items is to include as much as information possible but preventing strong linear dependencies between the variables. Ideally, after this procedure, parcels should be complete. However, because in some cases all the items are missing, parcels can still have missing records (although less than the original scales).
Why not just to use the average of the available items? That solution would implicitly assume that items perfectly correlates with the scale. We know that is not a good assumption. That is why, after all, we worry about creating scales. Using parcels takes advantage of the available information (items with complete information) and the relationship between a portion of items and the scale.
Here I show a simple example using the National Longitudinal Study of Adolescent to Adult Health (Add Health). First, let’s look some descriptives of the variables included in the imputation. I am using information from Wave 1 and 2. The key scales/scores are depression (19 items) and GPA (4 items). Variables ending with .p are parcels with 1/2 of the items of the original scale.
As expected, the correlation between the scales and parcels is high. GPA variables have most of the problems. Note that parcels .p still have missing records, although much less than the original scales.
I use the R package MICE to impute the data.
I adjusted the predictor matrix to avoid feedbacks during the imputation (circularity between variables). The main adjustment is to use only complete variables when imputing parcels.
Here the adjusted predictor matrix:
Let’s impute the data!
Some plots to explore how the imputation went.
I don’t see any problematic pattern. It looks as I got a proper solution. The distribution of the variables also looks right.
Last Update: 06/02/2017
Enders, Craig K. 2010. Applied Missing Data Analysis. The Guilford Press.
Eekhout, Iris, Craig K. Enders, Jos W. R. Twisk, Michiel R. de Boer, Henrica
C. W. de Vet, and Martijn W. Heymans. 2015. “Analyzing Incomplete Item Scores
in Longitudinal Data by Including Item Score Information as Auxiliary
Variables.” Structural Equation Modeling: A Multidisciplinary Journal 22