Multiple imputation of scales generated by several items can be challenging. Fortunately, to impute every single item is not the only solution to the missing data problem. Some practical and theoretically attractive alternative have already been proposed. In this post, I show a simple implementation of what Enders (2010) calls duplicated-scale imputation, a method orginally suggested by Eekhout et al. (2011). By the way, thanks Iris Eekhout for replying my e-mails!
For each scale, I define a number or proportion of items (let’s say p) to create parcels (i.e., average of items although not the whole scale). These parcels are, then, used as auxiliary variables to impute the original scales. There are different ways to define parcels. I implemented a solution in my R package sdazar, see the function rowscore for more details.
The function rowscore selects p items with the least missing data. For each case (row), it computes the parcels using the available information of the selected items. If only one item has information, only that one will be used. If there are more than one item with valid data, it will average all the selected items. If there are no items available in my initial selection, it picks p items from the rest of unselected items to impute the original scale. In this particular example, I create parcels using half of the items:
The reason for using a proportion of the original items is to include as much information as possible, but preventing strong linear dependencies between variables. Ideally, parcels are complete (no missing values). However, in some cases all the items are missing, so parcels can still have missing records (although less than the original scales).
Why not just to use the average of the available items? That solution would implicitly assume that items perfectly correlate with the scale. We know that’s not a good assumption. That is why we worry about creating scales in the first place, right? Using parcels takes advantage of the available information (items with complete information) and the relationship between a portion of items and the scale.
Here I show a simple example using the National Longitudinal Study of Adolescent to Adult Health (Add Health). First, let’s look some descriptives of the variables included in the imputation. I am using information from Wave 1 and 2. The scales/scores I am imputing are depression (19 items) and GPA (4 items). Variables ending with .p are parcels with 1/2 of the items of the original scale.
As expected, the correlation between the scales and parcels is high. GPA has most of the problems. Note that parcels .p still have missing records, although much less than the original scales.
Let’s now impute the scales/scores using the R package MICE.
I adjust the predictor matrix to avoid feedbacks during the imputation (circularity between variables). The trick is to use only complete variables when imputing parcels.
Here the adjusted predictor matrix:
Let’s impute the data!
Below some plots to explore how the imputation goes.
I don’t see any problematic pattern. It seems I get a proper solution. The distribution of the variables also looks right.
Last Update: 06/02/2017
Enders, Craig K. 2010. Applied Missing Data Analysis. The Guilford Press.
Eekhout, Iris, Craig K. Enders, Jos W. R. Twisk, Michiel R. de Boer, Henrica C. W. de Vet, and Martijn W. Heymans. 2015. “Analyzing Incomplete Item Scores in Longitudinal Data by Including Item Score Information as Auxiliary Variables.” Structural Equation Modeling: A Multidisciplinary Journal 22 (4):588-602.