Power up your research: Interaction Effects and Sample Size

Very often, I encounter practitioners interested in estimating an “interaction effect,” also known as a “moderation effect” in management. The goal is to determine whether the effect of a certain intervention is influenced by a third factor. For instance, in a marketing context, the effectiveness of advertising (independent variable) on sales performance (dependent variable) might be influenced by demographics or socioeconomic status (moderator).

Since I read the viral blog post by Andrew Gelman on this topic some time ago, I always remind researchers that the number of observations required to achieve a well-powered test for the interaction effect is usually much higher than those needed for detecting a main effect. Gelman’s post seems to have went viral as a “16 times rule” for this purpose, but requirements depend on assumptions.

A low-power test means low chances of detecting a true effect, which could lead to wasted resources. Moreover, it might result in a gross overestimation of the true effect.

To provide a practical understanding of these requirements and to illustrate the issue, my colleague Francisco Sesto and I built a small dashboard based on simulations. This tool allows you to manipulate the magnitudes of the main and interaction effects and visualize the power and the number of observations required.

The dashboard

This dashboard allows you to explore the relationship between sample size and the power to detect: i) a main effect and ii) an interaction effect.

model_explanation2

This dashboard allows you to explore the relationship between sample size and the power to detect a chosen: i) main effect and ii) an interaction effect.

Power is defined as the probability of detecting an effect (i.e., rejecting the null hypothesis) when an actual effect exists.

Generally, the larger the actual effect, the higher the power. Similarly, the larger the sample size, the higher the power. This analysis is typically employed to determine the sample size required to detect a given effect at a specified power. Alternatively, if additional data collection is not possible, understanding the test's power can help identify potential limitations in the analysis.

Consider a scenario where you want to test the effect of an intervention (T). Define $\beta_T$ to measure the size of the intervention's effect, which is the difference in the mean outcome between treated and non-treated units. The dashboard allows you to select the size of this effect. For convenience the effect is expressed in standarized terms (measured in terms of the standard deviation of the outcome). For example, $\beta_T = 1$ means testing if the effect equals a one standard deviation size effect.

Additionally, suppose you are interested in testing for a larger effect in a specific subgroup (e.g., a larger effect for women than for men). This interaction effect is defined as the difference between the intervention effects for men $\beta_{T,W}$ ) and women ( $\beta_{T,M}$ ). For convenience, if the difference between subgroup effects is half a standard deviation, the dashboard lets you select this difference as a percentage of the main effect, denoted as $\theta_{INT} = \frac{\beta_{T,W} - \beta_{T,M}}{\beta_T}$ .

Insights

The reason why so many more observations are often needed for interaction effects is that, when measuring such an effect , you are typically:

Trying to measure a much smaller effect than the main effect (e.g., the expected difference between demographic groups is much smaller than the average effect being studied).
Dividing the data into more groups (to estimate differences-in-differences), leading to larger standard errors.

Additionally, our dashboard reveals that these requirements are not linear. The "16 times" rule is a rough guideline and is highly sensitive to the relative sizes of the effects you aim to capture.

By understanding these nuances and using tools like our dashboard, practitioners can better plan their studies to ensure sufficient power for detecting interaction effects. This not only improves the reliability of their findings but also optimizes resource allocation.

Technical Notes

The objective of this dashboard was to provide an approximation on sample size requirements, under a set of basic assumptions.

The outcome variable is assumed to be normally distributed.
For simplicity, the sizes of the two subgroups in the population are assumed to be equal.
The parameters are defined in the following regression model:

where (W) is a dummy variable indicating the subgroup, such as 1 for men and 0 for women. Power was computed assuming 5% significance, and 1000 iterations per scenario.

References

This dashboard was inspired by Andrew Gelman's "You need 16 times the sample size to estimate an interaction than to estimate a main effect" blog post (dated March 15, 2018), which is also discussed in the book "Regression and Other Stories" by Gelman, A., Hill, J., & Vehtari, A. (2021).