Calculating power for complex and uplift models using machine learning can be challenging. I will provide some general guidelines and references at the end of this post.
When conducting a statistical test to estimate an effect, we can have any of the following outcomes:
These errors and success probabilities are associated with standard statistical terms used in making inferences.
As a standard, \(\alpha\) is set to 0.05 and power to .80. Also, we will generally have a control group (no intervention) and different treatments (variants).
There are some challenges associated with basic power calculations (e.g., web calculators):
When conducting multiple statistical tests or comparisons within a study or analysis, there is a risk of obtaining false positive results. To address this, we need to adjust our \(\alpha\) level based on the number of tests performed. This adjustment increases the required sample size to achieve statistical significance. It is important to consider this issue during experiment design and data analysis, including post hoc analyses.
To gain an understanding of why multiple comparison testing poses a challenge, let’s consider a simple example. Suppose we have a control group and three treatment variants. In this example, we will simulate data where there are no effects or differences between the control and treatment groups. In other words, the null hypothesis (i.e., no effect) is always true. In each iteration, we test the control group against each treatment and count the number of significant results we obtain using a significance level (\(\alpha=0.05\)). It is important to note that we already know there are no differences between the groups.
Probability of at least one Type I error in a batch of 3 tests: 12.6%
After 1000 iterations, the probability of encountering at least one Type I error in a batch of 3 tests is 12.6%. You can also use the formula:
\[1 - (1 - \alpha)^n\]Where \(n\) is the number of comparisons. For this example, FWER will be 14.2%. This probability is referred to as the Family-wise Error Rate (FWER), which signifies the likelihood of committing at least one Type I error within the entire family (or batch) of comparisons. It is evident that 12.6% and 14.2% are higher than the assumed α value of 5% for a single test.
This is the essence of the multiple comparison problem. When conducting multiple tests, even if you maintain an \(\alpha\) level of approximately 0.05 for each individual test, the likelihood of experiencing at least one false discovery among the entire set of tests is significantly higher and increases as the number of tests in the batch grows. Consequently, some of our significant findings may actually be false positives and difficult to replicate.
The key to adjusting multiple comparison testing is to strike a balance between reducing false positives (incorrectly rejecting the null hypothesis) and maintaining test power (the ability to correctly reject the null hypothesis when it’s false). Some methods, such as Bonferroni or Holm, minimize the Family-Wise Error Rate (FWER). Others, like Benjamini-Hochberg, minimize the False Discovery Rate (FDR), which is the expected proportion of Type I errors among all declared significant hypotheses. FDR methods aim to reduce the ratio of Type I errors among significant results without eliminating all false discoveries like FWER methods. FDR methods generally offer higher statistical power and are less conservative.
A crucial element of experimental design lies in sample allocation. Variations in the distribution of samples among groups can influence the statistical power of our tests. Sometimes, the goal is to limit the potential disruption from an intervention by scaling down the sample size across various variants. Alternatively, we might opt to shrink the control group’s size or allot more units to each variant, contingent on their Minimum Detectable Effect (MDE).
Let’s not forget to consider expected effects. In figuring out the smallest minimum detectable effect (MDE), think about the least significant change that would still make the intervention worthwhile. It’s all about balancing the books: estimate the return on investment (ROI) from the intervention, then determine the tiniest shift in a key metric (like conversion rates) that would still make the effort profitable. The MDE doesn’t have to be identical across all variants - after all, not all interventions or campaigns cost the same.
I’ve developed a set of straightforward simulation methods in Python, primarily to construct and evaluate our experimental data. The choice to use simulation was driven by its flexibility in handling diverse scenarios and metrics. However, it does come with a trade-off, as it demands a higher computational power. We can live with that :smile:. Feel free to explore the code for the power class.
These methods are useful, particularly for handling multiple variants, different allocations, or MDE by group. However, it’s important to note that these methods only offer a power estimate based on a given set of parameters. Therefore, we will need to conduct a grid search to evaluate various scenarios and determine the optimal one.
Let’s start with a simple example using a proportion as the metric of interest:
comparisons power
0 (0, 1) 0.706
nsim
represents the number of simulations. Variants are set to 1, so we are only comparing control and treatment: comparisons = (0,1)
. This example is too simple. Let’s now assume two variants (treatments):
comparisons power
0 (0, 1) 0.531
1 (0, 2) 0.530
2 (1, 2) 0.015
The function calculates the power of group comparisons with a sample size of 3000 for each group. Multiple comparisons result in reduced power. Since the effect of each variant is the same (0.03), comparing variants 1 and 2 is not meaningful here (it will always be 0 in practice). Custom comparisons can be defined using a list of tuples, such as comparisons=[(0,1), (0,2)]
.
comparisons power
0 (0, 1) 0.558
1 (0, 2) 0.563
Using only two comparisons (each variant with a control), the power significantly decreases compared to the one-variant example. I have currently implemented the following p-value corrections:
bonferroni
holm_bonferroni
hochberg
sidak
fdr
You can find an introduction to different methods here. In general, Bonferroni is more conservative, while Holm provides higher power. The fdr
method aims to minimize the false discovery rate.
The function corrects only for the comparisons specified in the comparisons
parameter. If you conduct additional tests, such as comparing the performance of group A versus group B or examining differences among demographic groups, you need to apply additional multiple comparison corrections in your analysis.
We can introduce greater variability into the experimental design. We can then plot the expected power for different scenarios. To specify parameters, we’ll utilize nested lists and the grid_sim_power
method. However, things can quickly become convoluted.
Let’s consider a more complex example. The impact of the first and second variants on the control group varies as follows: [0.01, 0.03]
, [0.03, 0.05]
, [0.03, 0.07]
. The sample sizes are equal for each group, but they increase linearly. We apply the holm
correction.
As expected, the best scenario occurs when the effect sizes are large enough to be detected. We can also analyze the results of different sample allocations. For example, when we look at the effects [0.03, 0.05]
, the allocation [3000, 7000, 7000]
produces fairly good results, although it doesn’t always meet the 0.8 threshold.
We can also use other metrics (e.g., average or counts). For instance, we can design an experiment where the outcome is the number of visits (counts). The simulator will use a Poisson distribution with the parameter, \(\lambda\) (lambda) or the mean number of events. In this example, I set a baseline rate (lambda) of 1.2 visits and a relative increase of \(0.05 (1.2*1.05) = 1.26\), with a control group of 3000 users and a treatment of 5000 users:
comparisons power
0 (0, 1) 0.59
The effect is small, so our test is underpowered (<0.80). We can also use averages (e.g., revenue), but in that case, we need to specify the standard deviation of the groups:
comparisons power
0 (0, 1) 0.6592
When using uplift or mixed models, things become more complicated. As Aleksander Molak put it:
The question of defining a “safe” dataset size for S-Learner and other causal models is difficult to answer. Power calculations for machine learning models are often difficult, if possible at all.
There are some tricks we can apply, though, as Harrell suggests:
If you can afford a pilot study or you have some historical data that represents a problem similar to the one that you’re interested in, you can find a subgroup in your data that is as homogenous as possible. You can estimate the sample size for this group using some of the traditional statistical power tools. Finally, scale your overall sample size so that this subgroup is properly powered relative to the entire sample
There are also some traditional ways to optimize the power of our tests:
For instance, after learning from our uplift models, we can identify the key features associated to users’ responses. We can use those features to design our experiments (blocking). We can also evaluate the results of our uplift models in a new sample, and see if we can replicate the expected uplift
.
Here’s a brief and general overview of the key features to keep in mind when designing experiments or improving models. I hope it’s helpful.
This small package extracts notes from a collection and creates a CSV file that can be easily read using Excel. You only need to specify the collection ID. For instance, if the location of my collection is https://www.zotero.org/groups/2406179/csic-echo/collections/M8N2VMAP
, the collection ID would be M8N2VMAP
. We also need the Zotero API’s credentials.
To create a clean CSV, notes’ headers would need a suitable separator. The default is #
. In this case, the text between headings mustn’t include #
. Below an example:
# Research question
Estimates interaction effects between PGS of obesity and cohorts using HRS.
# Data
HRS
# Methods
Uses a HLM whereby they estimate effects of age and cohorts while making the
intercepts and slopes a function of individual factors.
pip install git+https://github.com/sdaza/zotnote.git
You can save your Zotero API credentials in a config.py
and load them using import config
:
library_id = "0000000"
api_key = "key"
library_type = "group"
Let’s try to extract some notes and read them using Pandas:
Notes saved in zotero-notes.csv
The best way to do this in Anylogic would be using a database and then export, read, or connect to the database to process simulation results (although, see the section update below). We can do that easily in Anylogic. Every time an experiment finishes, we can export the data (from a database) to an Excel file manually.
The issue with Excel files is, on the one hand, they are Excel files, and on the other, they are not suitable for big data (more than 1 million rows). We can create a function to save all the simulation tables into an Excel file as our experiment finishes. However, we will still have the limit-of-rows limitation (check the Anylogic file linked below for a function to create Excel files from a database).
Here I follow a different approach by exporting an Anylogic database table to a CSV file within Java. The general setup using Anylogic PLE 8.6 would be:
I define two tables (data1
and data2
). Each agent saves its data at a given rate. After 5 years, the simulation will finish.
The tables include a column with the experiment iteration and replicate, in addition to the agent’s index, time, and a random value from a normal distribution.
The key function to export the data to a CSV file is f_SQLToCSV
. It uses two arguments, a SQL query (query
) and the path to an output file (filename
). For instance, we can write:
You can use any query for your data, giving you a lot of flexibility on what to export to a CSV file. The f_SQLToCSV
method is:
The next step would be to create an experiment and complete the Java actions accordingly. First, we clear our tables.
Then, we collect information on the iteration and replicate of the simulation run:
Finally, at the end of the experiment, we save the data and clear the tables again:
The method f_exportTables
is just a function that goes through each table and export them to a CSV file. v_tables
is string array with the name of the tables I want to export {"data1", "data2"}
:
Remember to import some functions in the imports section
:
From there, we can create additional functions to select the tables to be exported. For more details, download the Anylogic File here.
When running several replicates of my simulation, saving the information in a database didn’t work as expected. My simulation just crashed, and I was not able to keep the data. I finally decided to follow my previous approach: create many CSV files – one per iteration and replicate – and read them using an R or Python function. I know you end up with a lot of CSV files, but at least the simulation doesn’t crash, and you can recover the output of your simulation as it goes.
]]>Every time I write a paper or report, I need to create descriptive tables using Latex. Over and over I create Adhoc tables, and I say to myself: Write a general function so you can save time in the next paper! I know there are some solutions out there, but in general, I feel they are not flexible enough.
I introduce a far from perfect function to create descriptive tables in Latex. The steps and structure are quite simple:
That’s it. You can see the function here. It has some features might be useful:
Let’s start creating some fake data:
We can define a descriptive function:
Thus, the grouping of rows is defined by the name of each dataset in the list. We can add a note, just remember to add \usepackage[flushleft]{threeparttable}
to your Latex document:
We can also slice the descriptives by group:
It’s just a first version. I will add more features soon.
]]>To read this file, usually with extension .txt
or .dat
, I first need to know where each column starts and finishes. What I get from the pdf file is something like this:
The layout is usually a codebook in Word/PDF or just plain text file. Here, I copy the PDF text and put it in a plain text file. I use a text editor (e.g., Sublime Text) and regular expressions to extract the information I need.
I have to select every row with this pattern: 1-2 2 FIPS State code Numeric
. That is, a number followed by a hyphen (although not always, particularly when the width of the column is one), spaces, another number, spaces, and then any text. I use the following regular expression to get that pattern: (^[0-9]+).([0-9]+)\s+([0-9])\s+(.+)
. Using the Sublime package Filter Lines I get something like this (you can also just copy the selected lines):
1-2 2 FIPS State code Numeric
3-5 FIPS county code Numeric
6-9 4 Year of death Numeric
11-12 2 Age at death Numeric
13-16 4 ICD code for underlying cause-of-death 3 digits: Numeric
17-19 3 Cause-of-Death Recode Numeric
20-23 4 Number of deaths Numeric
This approach might be particularly useful when you have a long PDF/Word file and you want to extract most of the variables. You would need to adapt the regular expressions I’m using to the particular patterns of your codebook.
To simplify, I format this text as a comma-separated values file (csv). Replacing this regular expression ([0-9]+)(-)([0-9]+)(\s)([0-9]+)(\s)(.+)(\s)(Numeric)
by \1,\3,\5,\7,\9
I get:
1,2,2,FIPS State code,Numeric
3,5,3,FIPS county code,Numeric
6,9,4,Year of death,Numeric
11,12,2,Age at death,Numeric
13,16,4,ICD code for underlying cause-of-death 3 digits:,Numeric
17,19,3,Cause-of-Death Recode,Numeric
20,23,4,Number of deaths,Numeric
Then, I read the layout file:
Now, I can read the fixed-width data file. I use the readr package (in my experience relatively fast for big datasets ~ 1 GB).
Hopefully, this might save you some time!
Last Update: 06/29/2017
]]>acsr
package helps extracting variables and computing statistics using the America Community Survey and Decennial US Census. It was created for the Applied Population Laboratory (APL) at the University of Wisconsin-Madison. The functions depend on the acs
and data.table
packages, so it is necessary to install then before using acsr
. The acsr
package is hosted on a github repository and can be installed using devtools
:
Remember to set the ACS API key, to check the help documentation and the default values of the acsr
functions.
The default dataset is acs
, the level is state
(Wisconsin, state = "WI"
), the endyear
is 2014, and the confidence level to compute margins of error (MOEs) is 90%.
The acsr
functions can extract all the levels available in the acs
package. The table below shows the summary and required levels when using the acsdata
and sumacs
functions:
summary number | levels |
---|---|
010 | us |
020 | region |
030 | division |
040 | state |
050 | state, county |
060 | state, county, county.subdivision |
140 | state, county, tract |
150 | state, county, tract, block.group |
160 | state, place |
250 | american.indian.area |
320 | state, msa |
340 | state, csa |
350 | necta |
400 | urban.area |
500 | state, congressional.district |
610 | state, state.legislative.district.upper |
620 | state, state.legislative.district.lower |
795 | state, puma |
860 | zip.code |
950 | state, school.district.elementary |
960 | state, school.district.secondary |
970 | state, school.district.unified |
We can use the sumacs
function to extract variable and statistics. We have to specify the corresponding method (e.g., proportion or just variable), and the name of the statistic or variable to be included in the output.
To download the data can be slow, especially when many levels are being used (e.g., blockgroup). A better approach in those cases is, first, download the data using the function acsdata
, and then use them as input.
When computing statistics there are two ways to define the standard errors:
one.zero = FALSE
)one.zero = TRUE
)one.zero = TRUE
For more details about how standard errors are computed for proportions, ratios and aggregations look at A Compass for Understanding and Using American Community Survey Data.
Below an example when estimating proportions and using one.zero = FALSE
:
When one.zero = TRUE
:
When the square root value in the standard error formula doesn’t exist (e.g., the square root of a negative number), the ratio formula is instead used. The ratio adjustment is done variable by variable .
It can also be that the one.zero
option makes the square root undefinable. In those cases, the function uses again the ratio formula to compute standard errors. There is also a possibility that the standard error estimates using the ratio formula are higher than the proportion estimates without the one.zero
option.
Let’s get the African American and Hispanic population by state. In this case, we don’t have any estimation of margin of error.
The output can be formatted using a wide or long format:
And it can also be exported to a csv file:
We can combine geographic levels using two methods: (1) sumacs
and (2) combine.output
. The first one allows only single combinations, the second multiple ones.
If I want to combine two states (e.g., Wisconsin and Minnesota) I can use:
If I want to put together multiple combinations (e.g., groups of states):
Let’s color a map using poverty by county:
In sum, the acsr
package:
Last Update: 02/07/2016
]]>For each scale, I define a number or proportion of items (let’s say p) to create parcels (i.e., average of items although not the whole scale). These parcels are, then, used as auxiliary variables to impute the original scales. There are different ways to define parcels. I implemented a solution in my R package sdazar, see the function rowscore
for more details.
The function rowscore
selects p items with the least missing data. For each case (row), it computes the parcels using the available information of the selected items. If only one item has information, only that one will be used. If there are more than one item with valid data, it will average all the selected items. If there are no items available in my initial selection, it picks p items from the rest of unselected items to impute the original scale. In this particular example, I create parcels using half of the items:
The reason for using a proportion of the original items is to include as much information as possible, but preventing strong linear dependencies between variables. Ideally, parcels are complete (no missing values). However, in some cases all the items are missing, so parcels can still have missing records (although less than the original scales).
Why not just to use the average of the available items? That solution would implicitly assume that items perfectly correlate with the scale. We know that’s not a good assumption. That is why we worry about creating scales in the first place, right? Using parcels takes advantage of the available information (items with complete information) and the relationship between a portion of items and the scale.
Here I show a simple example using the National Longitudinal Study of Adolescent to Adult Health (Add Health). First, let’s look some descriptives of the variables included in the imputation. I am using information from Wave 1 and 2. The scales/scores I am imputing are depression (19 items) and GPA (4 items). Variables ending with .p
are parcels with 1/2 of the items of the original scale.
As expected, the correlation between the scales and parcels is high. GPA has most of the problems. Note that parcels .p
still have missing records, although much less than the original scales.
Let’s now impute the scales/scores using the R package MICE.
I adjust the predictor matrix to avoid feedbacks during the imputation (circularity between variables). The trick is to use only complete variables when imputing parcels.
Here the adjusted predictor matrix:
Let’s impute the data!
Below some plots to explore how the imputation goes.
I don’t see any problematic pattern. It seems I get a proper solution. The distribution of the variables also looks right.
Last Update: 06/02/2017
Enders, Craig K. 2010. Applied Missing Data Analysis. The Guilford Press.
Eekhout, Iris, Craig K. Enders, Jos W. R. Twisk, Michiel R. de Boer, Henrica C. W. de Vet, and Martijn W. Heymans. 2015. “Analyzing Incomplete Item Scores in Longitudinal Data by Including Item Score Information as Auxiliary Variables.” Structural Equation Modeling: A Multidisciplinary Journal 22 (4):588-602.
]]>The packages contains four functions:
These examples show how to allocate a sample size into strata. Look at ?astrata in R for definitions of the allocation procedures that are available.
We can adjust a bit more:
That’s it. A simple package to do simple calculations.
]]>Note: I created a package with similar functions. See here.
The inputs are:
An example for n = 400 and all inputs at their default values:
The output is rounded to 4 decimals. A more complete example:
The sample size (n) has always to be lower than the population (N). It is important to note that the final sample size used to compute the sampling error is:
\[n = \frac{N}{deff} * rr\]Let’s get a sample size with an error of .03, a population of 1000 elements, a response rate of 0.80, and an effect design of 1.2:
If the the sample size is bigger than the population because of low response rates or big design effects, the sample size will be fixed to N:
Finally, we can estimate different sample sizes by strata using vectors or a data frame:
As easy as falling off a log!
]]>The data:
As can be seen, the data have five-year-interval age groups, so each projection forward will involve 5 years. The steps are very simple:
We have to estimate life table survival ratios, that is, proportions of birth cohorts surviving from one age interval to the next in a stationary population. Basically, we are summarizing the mortality experience of different cohorts assuming stationarity. Because census statistics refer to age “last birthday” (rather than exact age), I estimate ratios using $L_x$ (average number of survivors in an age interval) instead of $l_x$.
\[S_x = \frac{_5L_x}{_5L_{x-5}}\]I compute the survival ratios using a loop in R. The estimation of the open-ended survival ratio is slightly different but still straightforward:
\[\frac{T_{85}}{T_{80}}\]This is the tricky part. Because census statistics refer to age “last birthday”, and we are projecting every 5 years, the estimation of the number of person-years lived by women in each age group consists of the average number of women alive at the beginning and end of the period (assuming a linear change over the period). To take advantage of the Leslie matrix, I define the births in R using a loop as follows:
1/(1+1.05) corresponds to a transformation of age-specific fertility rates (son and daughters) to maternity rates (only daughters), assuming that the ratio of male to female births (SBR) is constant across mothers’ ages. The number of births is also adjusted by the corresponding survival ratio from 0 to 5 years old ($\frac{_5L_0}{5 \times l_0}$), the number 5 goes away due to simplifying).
I construct a Leslie matrix by replacing specific cells of a 18 x 18 matrix (18 age groups) by the vectors defined above (survival ratios and maternity rates):
Here we have the Leslie matrix:
Note that the last survival ratio is repeated in the last column (0.518). This is because the estimation of the open-ended survival ratio is:
\[(N_{80} + N_{85}) \times \frac{T_{85}}{T_{80}}\]Using the R multiplication operator for matrices, I do a 5-year projection by simply multiplying the Leslie matrix by the population vector (remember that matrix multiplication is not commutative).
I obtain the same results of the book. Raising this multiplication I can get the projected population of subsequent periods. Because R doesn’t have a power operator for matrices, I define a function called mp to raise matrices (it is not very efficient, but for this example it’s still useful).
Let’s project the initial population for two periods (10 years):
Again, I get the same result of the book. The nice thing of all this is that estimating eigenvalues and eigenvectors, I can obtain the intrinsic growth rate and age-distribution of the “stable equivalent” population. Using the eigen function in R, I can identify the dominant eigenvalue (higher absolute number), and the corresponding eigenvector:
The population is growing but little.
The population momentum corresponds to the growth of a population after imposing replacement fertility conditions, that is, NRR=1. Thus, the first thing we have to do is to estimate NRR.
We can quickly estimate the intrinsic growth rate using NRR:
Very close to our estimation using cohort component projection. To impose the replacement condition, I just have to divide the first row of the Leslie matrix by NRR.
To get the population momentum we have to project the initial population until the growth is zero (here I raised the matrix 100 times), and then to compute the ratio between the initial population and the non-growing population (stationary).
After imposing the replacement condition, the population grew 1%.
]]>