Raking weights with R
When I was working with public opinion surveys, I usually had to adjust the data according to population parameters such as sex, age, socioeconomic status, or region. There are several ways to do this. One of them is raking. Using the package anesrake by Josh Pasek is easy to compute raking weights in R. I show a brief and straightforward raking example in this post.
What is raking?
It is common to find differences between the population and sample distribution in some key demographic variables when analyzing surveys. These differences might be due to the sampling design, non-coverage issues or non-response. To avoid biases of point estimates, we generally adjust those differences using weights so that the marginal totals agree (at least in part) with the population. These adjustments can be done using different methods: post-stratification or cell-weighting, raking, general regression estimator GREG (Battaglia et al. 2004, Valliant et al. 2013). Here I will show an example for raking adjustment. Raking has some advantages over cell-weighting:
- We have to know only the population totals of the specific variables, not all cells of the cross-classification.
- It is more flexible feasible when we use several variables at once.
It is important to note that raking adjustments are useful when we want to estimate a proportion or mean from a population, but they are not necessarily the best when modeling (see Gelman 2007). In that case, model specification (i.e., inclusion of all the key variables and interactions) may be a better solution. It is important to note that the raking procedure may affect the dependence structure of variables, what eventually may bias model estimates.
Some rules of thumb
DeBell and Krosnick (2009) provide useful recommendations regarding variable selection when implementing raking:
- Identify a set of variable likely to be measured with little error and a low item nonresponse rate (less than 5%), and compare them with reliable data sources (e.g., Census). Some typical variables are age groups, gender, race, SES, census region.
- Implement weighting when substantive demography discrepancies are observed. For instance, differences higher than five percentage points. When differences are less than two percentage points this adjustment would not be necessary.
- When selected variables have categories with less than 5% in the sample, it would be recommended to collapse them.
Remember, however, that the ultimate goal of raking is to calibrate and improve the efficiency of our estimates (reduce standard errors). In general, there is trade-off between these two objectives. Anyway, weight adjustments should be variable-dependent: they should explore how different key variables of our survey behave after adjusting. In this post I only show how to implement raking. A more complete analysis will be surely necessary.
Once the raking procedure is implemented, it is a good idea to follow these steps in an iterative fashion:
- Identify extreme weights and truncate them. For example, those greater than five times the mean of weights.
- Truncate large values, but not small ones (i.e., those near zero). While large values increase the effect of outliers and are more likely to inflate variance, low values do not.
- Examine the design effect using the new weights. If the new design effect is greater than the old design effect by more than 0.5, it is a good idea to review the raking adjustment. In other words, calibration may cost too much in terms of efficiency (standard errors).
Three things can be done to improve the weighting procedure:
- Collapse categories of the variables
- Eliminate or replace weighting variables
- Increase the weight truncation criterion ( > 5 )
Raking using R
In this example I use data from Chile to estimate the presidential approval (Opinion Public Survey CEP, July-August 2012). Below, you can see the characteristics of the dataset:
dat <- read.csv("cep.csv")
str(dat)
## 'data.frame': 1512 obs. of 12 variables:
## $ fnu : int 1580 1579 890 871 1031 104 1103 263 924 105 ...
## $ area : int 1 1 2 1 1 1 1 1 1 1 ...
## $ region : int 13 13 9 9 10 2 13 5 9 2 ...
## $ comuna : int 13123 13123 9110 9102 10202 2201 13101 5106 9201 2201 ...
## $ sex : int 1 1 2 1 1 1 2 1 2 1 ...
## $ age : int 30 60 28 72 55 35 33 25 73 43 ...
## $ ses : int 2 2 3 3 4 3 2 3 4 2 ...
## $ phone : int 1 1 9 2 2 2 2 1 2 1 ...
## $ edu : int 4 4 4 1 2 3 4 4 2 4 ...
## $ pond : num 1.977 1.243 0.514 0.421 0.526 ...
## $ agecat : int 2 5 2 5 5 3 2 2 5 3 ...
## $ approval: int 2 1 2 1 1 2 2 2 2 2 ...
I use five variables to define raking weights: sex
, agecat
, ses
, region
, and area
. I get relative frequencies of these variables using the package weights
. Most of the variables have more than five percent in each category, with the exception of region
and ses
. For illustrative purposes, I will use those variables without collapsing their categories.
library(weights)
# sex: 1 male, 2 female
wpct(dat$sex)
## 1 2
## 0.407 0.593
# agecat: 1 18-24, 2 25-34, 3 35-44, 4 45-54, 5 55+
wpct(dat$agecat)
## 1 2 3 4 5
## 0.124 0.159 0.177 0.194 0.346
# ses: 1 abc1, 2 c2, 3 c3, 4 d, 5 e
wpct(dat$ses)
## 1 2 3 4 5
## 0.0390 0.1078 0.3651 0.4484 0.0397
# region: 1 to 15
wpct(dat$region)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 0.01323 0.04233 0.01389 0.04167 0.09921 0.05489 0.06151 0.13095 0.06349 0.04894 0.00397 0.01058 0.37632 0.02579 0.01323
# area: 1 urban, 2 rural
wpct(dat$area)
## 1 2
## 0.837 0.163
The next step is to specify the population distribution of the selected variables in a target list. I use two sources to obtain population values: the Chilean Census 2002, and the Bicentenario Survey 2009. It is hard to find reliable SES population parameters. The Bicentenario data provide, at least, an approximation.
# chilean census 2002
sex <- c(.49,.51)
agecat <- c(.163, .203, .195, .187, .253)
region <- c(0.015, 0.031, 0.016, 0.039, 0.102, 0.051, 0.059, 0.123,
0.056, 0.046, 0.006, 0.010, 0.408, 0.023 , 0.013)
area <- c(.869, .131)
# bicentenario survey 2009
ses <- c(.109, .184, .261, .364, .083)
# definitions of target list
targets <- list(sex, agecat,ses, region, area)
# important: to use the same variable names of the dataset
names(targets) <- c("sex", "agecat", "ses", "region", "area")
# id variable
dat$caseid <- 1:length(dat$sex)
The output shows that some distributions do not sum one. This is because of rounding. We can force any proportion distribution to sum one using force1 = TRUE
. The differences between the population and the sample distribution are all greater than five percentage points. This meets the variable selection criterion discussed above.
sum(region)
## [1] 0.998
library(anesrake)
anesrakefinder(targets, dat, choosemethod = "total")
## sex agecat ses region area
## 0.1652 0.2017 0.3780 0.0828 0.0634
I apply the anesrake function as follows:
- The maximum weight value is five, weights greater than five will be truncated (
cap = 5
). - The total differences between population and sample have to be greater than 0.05 so that to include a variable (
pctlim = .05
). - The maximum number of variables included in the raking procedure is five (
nlim = 5
).
outsave <- anesrake(targets, dat, caseid = dat$caseid,
verbose= FALSE, cap = 5, choosemethod = "total",
type = "pctlim", pctlim = .05 , nlim = 5,
iterate = TRUE , force1 = TRUE)
summary(outsave)
## $convergence
## [1] "Complete convergence was achieved after 48 iterations"
##
## $raking.variables
## [1] "sex" "agecat" "ses" "region" "area"
##
## $weight.summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.33 0.62 0.80 1.00 1.10 4.30
##
## $selection.method
## [1] "variable selection conducted using _pctlim_ - discrepancies selected using _total_."
##
## $general.design.effect
## [1] 1.38
##
## $sex
## Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.49 616 0.407 741 0.49 0.0826 0 0.0826
## 2 0.51 896 0.593 771 0.51 -0.0826 0 -0.0826
##
## $agecat
## Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.163 187 0.124 246 0.163 0.03916 2.78e-17 0.03916
## 2 0.203 240 0.159 307 0.203 0.04407 -2.78e-17 0.04407
## 3 0.195 268 0.177 295 0.195 0.01756 0.00e+00 0.01756
## 4 0.187 294 0.194 282 0.187 -0.00763 -2.78e-17 -0.00763
## 5 0.253 523 0.346 382 0.253 -0.09315 0.00e+00 -0.09315
##
## $ses
## Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.1089 59 0.0390 165 0.1089 0.0699 1.39e-17 0.0699
## 2 0.1838 163 0.1078 278 0.1838 0.0760 2.78e-17 0.0760
## 3 0.2607 552 0.3651 394 0.2607 -0.1043 5.55e-17 -0.1043
## 4 0.3636 678 0.4484 550 0.3636 -0.0848 5.55e-17 -0.0848
## 5 0.0829 60 0.0397 125 0.0829 0.0432 1.39e-17 0.0432
##
## $region
## Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.01503 20 0.01323 22.73 0.01503 0.001803 0.00e+00 0.001803
## 2 0.03106 64 0.04233 46.97 0.03106 -0.011266 0.00e+00 -0.011266
## 3 0.01603 21 0.01389 24.24 0.01603 0.002143 0.00e+00 0.002143
## 4 0.03908 63 0.04167 59.09 0.03908 -0.002589 0.00e+00 -0.002589
## 5 0.10220 150 0.09921 154.53 0.10220 0.002998 -2.78e-17 0.002998
## 6 0.05110 83 0.05489 77.27 0.05110 -0.003792 -6.94e-18 -0.003792
## 7 0.05912 93 0.06151 89.39 0.05912 -0.002390 0.00e+00 -0.002390
## 8 0.12325 198 0.13095 186.35 0.12325 -0.007706 0.00e+00 -0.007706
## 9 0.05611 96 0.06349 84.84 0.05611 -0.007380 0.00e+00 -0.007380
## 10 0.04609 74 0.04894 69.69 0.04609 -0.002850 0.00e+00 -0.002850
## 11 0.00601 6 0.00397 9.09 0.00601 0.002044 0.00e+00 0.002044
## 12 0.01002 16 0.01058 15.15 0.01002 -0.000562 0.00e+00 -0.000562
## 13 0.40882 569 0.37632 618.13 0.40882 0.032495 -5.55e-17 0.032495
## 14 0.02305 39 0.02579 34.85 0.02305 -0.002748 0.00e+00 -0.002748
## 15 0.01303 20 0.01323 19.70 0.01303 -0.000201 -3.47e-18 -0.000201
##
## $area
## Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.869 1266 0.837 1314 0.869 0.0317 1.11e-16 0.0317
## 2 0.131 246 0.163 198 0.131 -0.0317 0.00e+00 -0.0317
The procedure reaches good convergence, weights lower than 5, and very small differences between the sample and population distribution. In addition, the design effect is 1.38. This gives us an idea of the weighting loss due to the raking procedure. The weighting loss (\(L_w\)) is the inflation in the variance of sample estimates that can be attributed to weighting (Heeringa et. al. 2010). A simple approximation is:
\[L_w(\bar{y}) \approx \frac{\sigma^{2}(w)}{\bar{w}^2} = \Bigg( \frac{\sum\limits_{i=1}^n w_i^2}{\big(\sum\limits_{i=1}^n w_i\big)^2} \cdot n \Bigg) - 1\]This simple formula was introduced by Kish (1965) in the context of proportionated stratified sampling. It represents the proportional increase in the variances of means and proportions, ignoring any clustering that may be included in the sample plan. It assumes that:
- Proportionate allocation is the optimal stratified design, i.e., variances of \(y\) are approximately equal in all strata.
- Weights are uncorrelated with the values of the random variable \(y\).
Thus, values from this formula only represents an approximation to weighting loss provided that model assumptions are met. As can be seen in the anesrake
output, the design effect is 1.38. That value is equal to \(1 + L_w\).
# add weights to the dataset
dat$weightvec <- unlist(outsave[1])
n <- length(dat$region)
# weighting loss
((sum(dat$weightvec ^ 2) / (sum(dat$weightvec)) ^ 2) * n) - 1
## [1] 0.38
For a more precise estimation of the design effect it would be necessary to use a package such as survey, specifying the characteristics of the sample design. Unfortunately, I do not have enough information to specify the sample design and to estimate the design effect of the presidential approval. There are also no cluster variables or replicate weights in the CEP dataset.
Anyway, the weighting loss approximation denotes an increase in the design effect lower than .5, what meets the recommendations mentioned above. However, it is important to keep in mind that this weighting procedure does not take into account any clustering effect associated with the sample design.
Finally, we can estimate the presidential approval with and without weights:
unweighted <- wpct(dat$approval)
weighted <- wpct(dat$approval, dat$weightvec)
tab <- data.frame(unweighted, weighted)
rownames(tab) <- c("approve", "disapprove", "unsure", "dk")
tab
## unweighted weighted
## approve 0.2864 0.2983
## disapprove 0.5119 0.5206
## unsure 0.1779 0.1609
## dk 0.0238 0.0201
The difference between the weighted and unweighted presidential approval is not big, only 0.041 percentage points.
Raking using given survey weights
The CEP dataset includes its own weights (variable named pond
). Unfortunately, there is no detailed information about the weighting construction in the methodological documentation of this survey (“Survey weighting is a mess”, Gelman 2007). For illustrative purposes, however, I will show how to estimate raking weights using previous weights.
Exploring the pond
variable, we can see that there are some big weights (e.g., 17.6). This makes me think that they are just scaled design weights, probably with some non-response adjustment, but I am not really sure.
summary(dat$pond)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02 0.46 0.79 1.00 1.24 17.60
The weighting loss associated with the pond weights is relatively high, denoting a design effect of 2.09.
# weighting loss
((sum(dat$pond ^ 2) / (sum(dat$pond)) ^ 2) * n) - 1
## [1] 1.09
We can explore the total differences between the sample and population distributions using pond
weights. Only ses and region meet the criterion of total differences greater than 5 percentage points. However, I will keep all the variables used in the previous section and specify that total differences between the population and sample should be greater than .05 (pctlim = .05
).
anesrakefinder(targets, dat, weightvec = dat$pond, choosemethod = "total")
## sex agecat ses region area
## 0.000405 0.014836 0.317979 0.169910 0.001149
In order to get raking weights using previous weights, I only have to add weightvec = dat$pond
to the anesrake
command.
outsave <- anesrake(targets, dat, caseid = dat$caseid,
weightvec = dat$pond, verbose = FALSE, cap = 5,
choosemethod = "total", type = "pctlim",
pctlim = .05 , nlim = 5, iterate = TRUE, force1 = TRUE)
summary(outsave)
## $convergence
## [1] "Complete convergence was achieved after 50 iterations"
##
## $raking.variables
## [1] "ses" "region"
##
## $weight.summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 0.41 0.71 1.00 1.23 5.00
##
## $selection.method
## [1] "variable selection conducted using _pctlim_ - discrepancies selected using _total_."
##
## $general.design.effect
## [1] 1.91
##
## $sex
## Target Old Weights N Old Weights % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.49 741 0.49 763 0.504 0.0147 -0.0145 0.000202
## 2 0.51 771 0.51 749 0.496 -0.0147 0.0145 -0.000202
##
## $agecat
## Target Old Weights N Old Weights % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.163 242 0.160 230 0.152 -0.00768 0.01049 0.002815
## 2 0.203 309 0.204 314 0.208 0.00388 -0.00521 -0.001321
## 3 0.195 289 0.191 297 0.197 0.00565 -0.00193 0.003715
## 4 0.187 281 0.186 267 0.176 -0.00956 0.01041 0.000843
## 5 0.253 391 0.259 403 0.267 0.00771 -0.01376 -0.006053
##
## $ses
## Target Old Weights N Old Weights % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.1089 77.1 0.0510 165 0.1089 0.0579 1.39e-17 0.0579
## 2 0.1838 209.7 0.1387 278 0.1838 0.0451 2.78e-17 0.0451
## 3 0.2607 588.5 0.3892 394 0.2607 -0.1284 5.55e-17 -0.1284
## 4 0.3636 596.3 0.3943 550 0.3636 -0.0307 5.55e-17 -0.0307
## 5 0.0829 40.5 0.0268 125 0.0829 0.0561 1.39e-17 0.0561
##
## $region
## Target Old Weights N Old Weights % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.01503 31.70 0.02097 22.73 0.01503 -0.00594 0.00e+00 -0.00594
## 2 0.03106 99.85 0.06604 46.97 0.03106 -0.03497 3.47e-18 -0.03497
## 3 0.01603 19.56 0.01293 24.24 0.01603 0.00310 0.00e+00 0.00310
## 4 0.03908 72.79 0.04814 59.09 0.03908 -0.00906 6.94e-18 -0.00906
## 5 0.10220 206.06 0.13627 154.53 0.10220 -0.03407 0.00e+00 -0.03407
## 6 0.05110 69.77 0.04614 77.27 0.05110 0.00496 6.94e-18 0.00496
## 7 0.05912 74.20 0.04907 89.39 0.05912 0.01005 0.00e+00 0.01005
## 8 0.12325 163.69 0.10826 186.35 0.12325 0.01499 0.00e+00 0.01499
## 9 0.05611 79.51 0.05258 84.84 0.05611 0.00353 -6.94e-18 0.00353
## 10 0.04609 67.87 0.04489 69.69 0.04609 0.00121 0.00e+00 0.00121
## 11 0.00601 5.33 0.00352 9.09 0.00601 0.00249 8.67e-19 0.00249
## 12 0.01002 11.90 0.00787 15.15 0.01002 0.00215 0.00e+00 0.00215
## 13 0.40882 559.46 0.36999 618.13 0.40882 0.03883 0.00e+00 0.03883
## 14 0.02305 28.42 0.01879 34.85 0.02305 0.00425 0.00e+00 0.00425
## 15 0.01303 21.99 0.01454 19.70 0.01303 -0.00151 0.00e+00 -0.00151
##
## $area
## Target Old Weights N Old Weights % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.869 1315 0.87 1296 0.857 -0.0124 0.0119 -0.000575
## 2 0.131 197 0.13 216 0.143 0.0124 -0.0119 0.000575
Only ses and region are used in the raking procedure. Some weights were truncated to 5, and the design effect decreased from 2.09 to 1.91. The differences between the sample and population distribution are not big.
# add weights to the dataset
dat$weightvec2 <- unlist(outsave[1])
pond <- wpct(dat$approval, dat$pond)
weighted_pond <- wpct(dat$approval, dat$weightvec2)
tab <- data.frame(unweighted, pond, weighted, weighted_pond)
rownames(tab) <- c("approve", "disapprove", "unsure", "dk")
tab
## unweighted pond weighted weighted_pond
## approve 0.2864 0.2739 0.2983 0.2945
## disapprove 0.5119 0.5231 0.5206 0.5310
## unsure 0.1779 0.1758 0.1609 0.1520
## dk 0.0238 0.0272 0.0201 0.0224
There are no great differences between the estimates weighted
and weighted_pond
(0.025). There are greater discrepancies between pond
and weighted
estimates (0.049). This reveals the importance of the socioeconomic status in the presidential approval.
Last Update: 7/30/15
References
Battaglia, M., D. Izrael, D. C. Hoaglin, and M. Frankel. 2004. “Tips and Tricks for Raking Survey Data (Aka Sample Balancing).” American Association of Public Opinion Research.
DeBell, Matthew and Jon A. Krosnick. 2009. “Computing Weights for American National Election Study Survey Data.” Stanford University.
Gelman, Andrew. 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical Science 22(2):153-64.
Heeringa, Steven, Brady T. West, and Patricia A. Berglund. 2010. Applied Survey Data Analysis. Boca Raton, FL: Chapman & Hall/CRC.
Valliant, Richard, Jill A. Dever, and Frauke Kreuter. 2013. Practical Tools for Designing and Weighting Survey Samples. New York, NY: Springer New York.