What is statistical power and precision?
This post is going to give you some practical tips to increase statistical power in your research. Before going there though, let’s make sure everyone is on the same page by starting with some definitions.
Statistical power is the probability that the test will reject the null hypothesis when the null hypothesis is false. Many authors suggest a statistical power rate of at least .80, which corresponds to an 80% probability of not committing a Type II error.
Precision refers to the width of the confidence interval for an effect size. The smaller this width, the more precise your results are. For 80% power, the confidence interval width will be roughly plus or minus 70% of the population effect size (Goodman & Berlin, 1994). Studies that have low precision have a greater probability of both Type I and Type II errors (Button et al., 2013).
To get an idea of how this works, here are a few examples of the sample size required to achieve .80 power for small, medium, and large (Cohen, 1992) correlations as well as the expected confidence intervals
|Population Effect Size||Sample Size for 80% Power||Estimated Precision|
|r = .10||782||95% CI [.03, .17]|
|r = .30||84||95% CI [.09, .51]|
|r = .50||29||95% CI [.15, .85]|
Studies in psychology are grossly underpowered
Okay, so now you know what power is. But why should you care? Fifty years ago, Cohen (1962) estimated the statistical power to detect a medium effect size in abnormal psychology was about .48. That’s a false negative rate of 52%, which is no better than a coin-flip! The situation has improved slightly, but it’s still a serious problem today. For instance, one review suggested only 52% of articles in the applied psychology literature achieved .80 power for a medium effect size (Mone et al., 1996). This is in part because psychologists are studying small effects. One massive review of 322 meta-analyses including 8 million participants (Richard et al., 2003) suggested that the average effect size in social psychology is relatively small (r = .21). To put this into perspective, you’d need 175 participants to have .80 power for a simple correlation between two variables at this effect size. This gets even worse when we’re studying interaction effects. One review suggests that the average effect size for interaction effects is even smaller (f2 = .009), which means that sample sizes of around 875 people would be needed to achieve .80 power (Aguinis et al., 2005). Odds are, if you took the time to design a research study and collect data, you want to find a relationship if one really exists. You don’t want to "miss" something that is really there. More than this, you probably want to have a reasonably precise estimate of the effect size (it’s not that impressive to just say a relationship is positive and probably non-zero). Below, I discuss concrete strategies for improving power and precision.
What can we do to increase power?
It is well-known that increasing sample size increases statistical power and precision. Increasing the population effect size increases statistical power, but has no effect on precision (Maxwell et al., 2008). Increasing sample size improves power and precision by reducing standard error of the effect size. Take a look at this formula for the confidence interval of a linear regression coefficient (McClelland, 2000):
MSE is the mean square error, n is the sample size, Vx is the variance of X, and (1-R2) is the proportion of the variance in X not shared by any other variables in the model. Okay, hopefully you didn’t nod off there. There’s a reason I’m showing you this formula. In this formula, decreasing any value in the numerator (MSE) or increasing anything in the denominator (n, Vx, 1-R2) will decrease the standard error of the effect size, and will thus increase power and precision. This formula demonstrates that there are at least three other ways to increase statistical power aside from sample size: (a) Decreasing the mean square error; (b) increasing the variance of x; and (c) increasing the proportion of the variance in X not shared by any other predictors in the model. Below, I’ll give you a few ways to do just that.
Recommendation 1: Decrease the mean square error
Referring to the formula above, you can see that decreasing the mean square error will have about the same impact as increasing sample size. Okay. You’ve probably heard the term "mean square error" before, but the definition might be kind of fuzzy. Basically, your model makes a prediction for what the outcome variable (Y) should be, given certain values of the predictor (X). Naturally, it’s not a perfect prediction because you have measurement error, and because there are other important variables you probably didn’t measure. The mean square error is the difference between what your model predicts, and what the true values of the data actually are. So, anything that improves the quality of your measurement or accounts for potential confounding variables will reduce the mean square error, and thus improve statistical power. Let’s make this concrete. Here are three specific techniques you can use:
a) Reduce measurement error by using more reliable measures(i.e., better internal consistency, test-retest reliability, inter-rater reliability, etc.). You’ve probably read that .70 is the "rule-of-thumb" for acceptable reliability. Okay, sure. That’s publishable. But consider this: Let’s say you want to test a correlation coefficient. Assuming both measures have a reliability of .70, your observed correlation will be about 1.43 times smaller than the true population parameter (I got this using Spearman’s correlation attenuation formula). Because you have a smaller observed effect size, you end up with less statistical power. Why do this to yourself? Reduce measurement error. If you’re an experimentalist, make sure you execute your experimental manipulations exactly the same way each time, preferably by automating them. Slight variations in the manipulation (e.g., different locations, slight variations in timing) might reduce the reliability of the manipulation, and thus reduce power.
b) Control for confounding variables. With correlational research, this means including control variables that predict the outcome variable, but are relatively uncorrelated with other predictor variables. In experimental designs, this means taking great care to control for as many possible confounds as possible. In both cases, this reduces the mean square error and improves the overall predictive power of the model – and thus, improves statistical power. Be careful when adding control variables into your models though: There are diminishing returns for adding covariates. Adding a couple of good covariates is bound to improve your model, but you always have to balance predictive power against model complexity. Adding a large number of predictors can sometimes lead to overfitting (i.e., the model is just describing noise or random error) when there are too many predictors in the model relative to the sample size. So, controlling for a couple of good covariates is generally a good idea, but too many covariates will probably make your model worse, not better, especially if the sample is small.
c) Use repeated-measures designs. Repeated measures designs are where participants are measured multiple times (e.g., once a day surveys, multiple trials in an experiment, etc.). Repeated measures designs reduce the mean square error by partitioning out the variance due to individual participants. Depending on the kind of analysis you do, it can also increase the degrees of freedom for the analysis substantially. For example, you might only have 100 participants, but if you measured them once a day for 21 days, you’ll actually have 2100 data points to analyze. The data analysis can get tricky and the interpretation of the data may change, but many multilevel and structural equation models can take advantage of these designs by examining each measurement occasion (i.e., each day, each trial, etc.) as the unit of interest, instead of each individual participant. Increasing the degrees of freedom is much like increasing the sample size in terms of increasing statistical power. I’m a big fan of repeated measures designs, because they allow researchers to collect a lot of data from fewer participants.
Recommendation 2: Increase the variance of your predictor variable
Another less-known way to increase statistical power and precision is to increase the variance of your predictor variables (X). The formula listed above shows that doubling the variance of X is has the same impact on increasing statistical precision as doubling the sample size does! So it’s worth figuring this out.
a) In correlational research, use more comprehensive continuous measures. That is, there should be a large possible range of values endorsed by participants. However, the measure should also capture many different aspects of the construct of interest; artificially increasing the range of X by adding redundant items (i.e., simply re-phrasing existing items to ask the same question) will actually hurt the validity of the analysis. Also, avoid dichotomizing your measures (e.g., median splits), because this reduces the variance and typically reduces power (MacCallum et al., 2002).
b) In experimental research, unequally allocating participants to each condition can improve statistical power. For example, if you were designing an experiment with 3 conditions (let’s say low, medium, or high self-esteem). Most of us would equally assign participants to all three groups, right? Well, as it turns out, assigning participants equally across groups usually reduces statistical power. The idea behind assigning participants unequally to conditions is to maximize the variance of X for the particular kind of relationship under study -- which, according the formula I gave earlier, will increase power and precision. For example, the optimal design for a linear relationship would be 50% low, 50% high, and omit the medium condition. The optimal design for a quadratic relationship would be 25% low, 50% medium, and 25% high. The proportions vary widely depending on the design and the kind of relationship you expect, but I recommend you check out McClelland (1997) to get more information on efficient experimental designs. You might be surprised.
Recommendation 3: Make sure predictor variables are uncorrelated with each other
A final way to increase statistical power is to increase the proportion of the variance in X not shared with other variables in the model. When predictor variables are correlated with each other, this is known as colinearity. For example, depression and anxiety are positively correlated with each other; including both as simultaneous predictors (say, in multiple regression) means that statistical power will be reduced, especially if one of the two variables actually doesn’t predict the outcome variable. Lots of textbooks suggest that we should only be worried about this when colinearity is extremely high (e.g., correlations around > .70). However, studies have shown that even modest intercorrlations among predictor variables will reduce statistical power (Mason et al., 1991). Bottom line: If you can design a model where your predictor variables are relatively uncorrelated with each other, you can improve statistical power.
Increasing statistical power is one of the rare times where what is good for science, and what is good for your career actually coincides. It increases the accuracy and replicability of results, so it’s good for science. It also increases your likelihood of finding a statistically significant result (assuming the effect actually exists), making it more likely to get something published. You don’t need to torture your data with obsessive re-analysis until you get p < .05. Instead, put more thought into research design in order to maximize statistical power. Everyone wins, and you can use that time you used to spend sweating over p-values to do something more productive. Like volunteering with the Open Science Collaboration.
Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect Size and Power in Assessing Moderating Effects of Categorical Variables Using Multiple Regression: A 30-Year Review. Journal of Applied Psychology, 90, 94-107. doi:10.1037/0021-9010.90.1.94
Button, K. S., Ioannidis, J. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376. doi: 10.1038/nrn3475
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65, 145-153. doi:10.1037/h0045186
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. doi:10.1037/0033-2909.112.1.155
Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Annals of Internal Medicine, 121, 200-206.
Hansen, W. B., & Collins, L. M. (1994). Seven ways to increase power without increasing N. In L. M. Collins & L. A. Seitz (Eds.), Advances in data analysis for prevention intervention research (NIDA Research Monograph 142, NIH Publication No. 94-3599, pp. 184–195). Rockville, MD: National Institutes of Health.
MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40. doi:10.1037/1082-989X.7.1.19
Mason, C. H., & Perreault, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28, 268-280. doi:10.2307/3172863
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537-563. doi:10.1146/annurev.psych.59.103006.093735
McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2, 3-19. doi:10.1037/1082-989X.2.1.3
McClelland, G. H. (2000). Increasing statistical power without increasing sample size. American Psychologist, 55, 963-964. doi:10.1037/0003-066X.55.8.963
Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research. Personnel Psychology, 49, 103-120. doi:10.1111/j.1744-6570.1996.tb01793.x
Open Science Collaboration. (in press). The Reproducibility Project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series). New York, NY: Taylor & Francis. doi:10.2139/ssrn.2195999
Richard, F. D., Bond, C. r., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7, 331-363. doi:10.1037/1089-2622.214.171.1241