Neuroimaging Power Guide
Purpose
Statistical power in neuroimaging is fundamentally different from power in behavioral research. The massive multiple comparisons problem (testing ~100,000 voxels simultaneously), spatial correlation structure, and non-standard test statistics mean that standard power formulas underestimate required sample sizes. Meanwhile, the field has historically been severely underpowered: the median fMRI study has only ~20% power to detect a typical effect (Button et al., 2013).
A competent programmer without neuroimaging training would apply standard power calculations (e.g., G*Power for a t-test) without accounting for multiple comparison correction, would not know typical effect sizes in neuroimaging, and would dramatically underestimate the sample sizes needed. This skill encodes the domain-specific knowledge for neuroimaging power analysis.
When to Use This Skill
- Planning sample size for a new fMRI, EEG, or MEG study
- Estimating power for grant applications or registered reports
- Determining whether a published study was adequately powered
- Choosing between ROI-based and whole-brain analysis based on power constraints
- Evaluating the reliability implications of sample size choices
Research Planning Protocol
Before executing the domain-specific steps below, you MUST:
- State the research question — What specific question is this analysis/paradigm addressing?
- Justify the method choice — Why is this approach appropriate? What alternatives were considered?
- Declare expected outcomes — What results would support vs. refute the hypothesis?
- Note assumptions and limitations — What does this method assume? Where could it mislead?
- Present the plan to the user and WAIT for confirmation before proceeding.
For detailed methodology guidance, see the research-literacy skill.
⚠️ Verification Notice
This skill was generated by AI from academic literature. All parameters, thresholds, and citations require independent verification before use in research. If you find errors, please open an issue.
Why Neuroimaging Power Is Different
Standard power analysis assumes a single statistical test. Neuroimaging involves:
| Challenge | Impact on Power | Source |
|---|---|---|
| Massive multiple comparisons | ~100,000 voxels tested; correction reduces sensitivity by orders of magnitude | Nichols & Hayasaka, 2003 |
| Spatial smoothness | Adjacent voxels are correlated, reducing effective number of independent tests but complicating power calculation | Worsley et al., 1996 |
| Multi-level inference | Subject-level estimation + group-level test; both levels contribute noise | Mumford & Nichols, 2008 |
| Effect size variability | Effects vary across voxels, regions, and subjects; no single "effect size" characterizes a study | Poldrack et al., 2017 |
| Threshold-dependent power | Power depends heavily on the statistical threshold (corrected vs. uncorrected) and correction method | Hayasaka et al., 2007 |
Key implication: A standard G*Power calculation for a two-sample t-test will dramatically overestimate the power of a whole-brain fMRI analysis because it ignores multiple comparison correction (Mumford & Nichols, 2008).
Typical Effect Sizes in Neuroimaging
fMRI Effect Sizes
| Analysis Type | Typical Effect Size | Unit | Source |
|---|---|---|---|
| Task activation (voxel-level) | Cohen's d = 0.5-1.0 | Standardized mean difference | Poldrack et al., 2017 |
| Task activation (ROI-level) | Cohen's d = 0.5-1.5 | Standardized mean difference | Poldrack et al., 2017 |
| Between-group difference (voxel) | Cohen's d = 0.3-0.8 | Standardized mean difference | Poldrack et al., 2017 |
| Functional connectivity (correlation) | r = 0.2-0.5 | Pearson correlation | Marek et al., 2022 |
| Brain-behavior association | r = 0.1-0.3 | Pearson correlation | Marek et al., 2022 |
| Brain-wide association (replicable) | r < 0.05 at N < 1000 | Pearson correlation | Marek et al., 2022 |
Critical finding: Marek et al. (2022) demonstrated that brain-behavior correlations in typical neuroimaging samples (N < 100) are severely inflated. Replicable brain-behavior associations require N > 2,000 for whole-brain analyses.
EEG/ERP Effect Sizes
| Analysis Type | Typical Effect Size | Source |
|---|---|---|
| ERP component amplitude (e.g., N400, P300) | Cohen's d = 0.3-0.8 | Boudewyn et al., 2018 |
| ERP latency differences | Cohen's d = 0.2-0.5 | Luck, 2014 |
| EEG oscillatory power | Cohen's d = 0.3-0.6 | Cohen, 2014 |
| EEG connectivity (coherence/PLV) | Cohen's d = 0.2-0.5 | Cohen, 2014 |
Sample Size Benchmarks
fMRI Sample Size Recommendations
| Design | Minimum N | Recommended N | Assumptions | Source |
|---|---|---|---|---|
| Within-subject task activation | 20 | 25-30 | Large effect (d > 0.8), lenient correction | Desmond & Glover, 2002 |
| Between-group comparison (large effect, d = 0.8) | 20 per group | 25-30 per group | Whole-brain, cluster-corrected | Thirion et al., 2007 |
| Between-group comparison (medium effect, d = 0.5) | 40 per group | 50+ per group | Whole-brain, cluster-corrected | Thirion et al., 2007; Poldrack et al., 2017 |
| Resting-state individual differences | 25+ | 50+ (much more for replicability) | Depends on reliability of measure | Marek et al., 2022 |
| Brain-behavior correlations | 100+ | N > 2,000 for replicable whole-brain | Large-scale only | Marek et al., 2022 |
| ROI-based analysis (a priori) | 15-20 | 25+ | Single ROI, no whole-brain correction | Desmond & Glover, 2002 |
EEG/ERP Sample Size Recommendations
| Design | Minimum per Condition | Recommended per Condition | Source |
|---|---|---|---|
| ERP trials per condition per subject | 30 | 40-60 | Boudewyn et al., 2018 |
| ERP between-group (medium d = 0.5) | 34 per group | 50+ per group | Boudewyn et al., 2018 |
| ERP within-subject (medium d = 0.5) | 25 subjects | 30+ subjects | Luck, 2014 |
| Time-frequency analysis | 40 trials | 60+ trials | Cohen, 2014 |
Power at Common Sample Sizes
| N (per group) | Power for d = 0.5 (uncorrected) | Power for d = 0.5 (corrected, whole-brain) | Power for d = 0.8 (corrected) |
|---|---|---|---|
| 10 | ~26% | < 10% | ~25% |
| 20 | ~50% | ~20% | ~50% |
| 30 | ~70% | ~35% | ~70% |
| 40 | ~82% | ~50% | ~85% |
| 60 | ~94% | ~70% | ~95% |
Values are approximate, based on simulations from Mumford & Nichols (2008) and Desmond & Glover (2002). Exact power depends on design, smoothness, effect spatial extent, and correction method.
Power Decision Tree
What type of analysis are you planning?
|
+-- Whole-brain voxelwise analysis
| |
| +-- Within-subject (one-sample t-test)
| | --> Minimum N = 20; aim for N = 25-30
| | (Desmond & Glover, 2002)
| |
| +-- Between-group comparison
| | |
| | +-- Large expected effect (d > 0.8)
| | | --> N = 20-25 per group (Thirion et al., 2007)
| | |
| | +-- Medium expected effect (d = 0.5)
| | | --> N = 40-50 per group (Poldrack et al., 2017)
| | |
| | +-- Small expected effect (d = 0.3)
| | --> N = 80+ per group; consider ROI approach
| |
| +-- Brain-behavior correlation
| --> N = 100+ minimum; N > 2,000 for replicability
| (Marek et al., 2022)
|
+-- ROI-based analysis (a priori regions)
| --> Use standard power formulas (G*Power) with expected
| effect size from literature or pilot data.
| No multiple comparison correction needed for single ROI.
| N = 15-30 typical for medium-large effects.
|
+-- ERP analysis
|
+-- Between-group
| --> 30-50 per group for medium effects
| (Boudewyn et al., 2018)
|
+-- Within-subject
--> 25-30 subjects, 30+ trials per condition
(Boudewyn et al., 2018; Luck, 2014)
Simulation-Based Power Approaches
fMRIpower (Mumford & Nichols, 2008)
Estimates power using pilot group-level activation maps:
- Run a pilot study (or use published results) to obtain group-level statistical maps
- Estimate effect sizes at each voxel from the pilot data
- Simulate new datasets with varying N by resampling from the estimated effect size and variance
- Apply the full statistical pipeline (including multiple comparison correction) to each simulation
- Power = proportion of simulations that detect the effect at a given ROI or voxel
Requirements: Pilot data from at least 10-15 subjects for stable variance estimates (Mumford & Nichols, 2008)
NeuroPowerTools (Durnez et al., 2016)
Web-based tool for peak-based power estimation:
- Upload an unthresholded statistical map from a pilot or published study
- The tool fits a mixture model to the peak distribution (null + alternative)
- Estimates the proportion of truly active voxels and their average effect size
- Computes power for new studies with varying N and thresholds
Advantage: Does not require individual subject data; can use published group maps URL: https://neuropowertools.org
Permutation-Based Power (Hayasaka et al., 2007)
- Generate simulated datasets under the alternative hypothesis using effect size maps from pilot data
- For each simulated dataset, run a full permutation test (5,000+ permutations)
- Compute power as the proportion of simulations in which the permutation test rejects the null
Advantage: Fully nonparametric; accounts for the exact multiple comparison correction used Disadvantage: Computationally expensive (requires running thousands of permutation tests per power estimate)
PowerMap (Joyce & Hayasaka, 2012)
Simulation-based power using parametric assumptions:
- Specify effect size map (from pilot data or assumed values)
- Specify noise model (based on residuals from pilot data)
- Simulate datasets with varying N
- Apply parametric statistical testing with specified correction method
- Estimate power at each voxel
Multiple Comparison Correction Impact on Power
The choice of correction method dramatically affects required sample size:
| Correction Method | Effective Alpha per Voxel | Relative Power | Source |
|---|---|---|---|
| None (p < 0.001 uncorrected) | 0.001 | Highest (but invalid inference) | -- |
| FDR q < 0.05 | ~0.0001-0.001 (data-dependent) | Moderate-High | Genovese et al., 2002 |
| Cluster-based (CDT p < 0.001) | Depends on cluster size | Moderate-High for large effects | Eklund et al., 2016 |
| Voxelwise FWE (RFT, p < 0.05) | ~0.00000005 | Low | Worsley et al., 1996 |
| TFCE + permutation | Varies | Moderate | Smith & Nichols, 2009 |
Domain insight: Switching from voxelwise FWE to cluster-based or FDR correction can increase power by 50-200% for the same sample size, because these methods exploit the spatial extent of true activations (Nichols & Hayasaka, 2003).
Test-Retest Reliability and Power
For individual differences designs (correlating brain measures with behavior), reliability of the brain measure is critical (Elliott et al., 2020):
| Measure | Typical ICC | Implication | Source |
|---|---|---|---|
| Task fMRI activation (ROI) | 0.3-0.6 | Poor to moderate reliability | Elliott et al., 2020 |
| Resting-state connectivity | 0.3-0.7 | Moderate reliability; depends on scan duration | Elliott et al., 2020 |
| ERP amplitude | 0.5-0.8 | Moderate to good | Cassidy et al., 2012 |
| EEG oscillatory power | 0.6-0.9 | Good to excellent | Cohen, 2014 |
Critical formula: The maximum detectable correlation between brain and behavior is bounded by the reliabilities of both measures:
r_observed_max = r_true * sqrt(reliability_brain * reliability_behavior)
With brain ICC = 0.5 and behavior reliability = 0.8, even a true correlation of r = 0.5 would appear as r = 0.5 * sqrt(0.5 * 0.8) = 0.32 on average (Elliott et al., 2020). This attenuation means far larger samples are needed.
Recommendation: For individual differences designs, collect longer scan sessions (at least 20-30 minutes of resting-state data; Birn et al., 2013) or use multi-session data to improve reliability.
Practical Power Calculation Workflow
For a New fMRI Study
- Define the primary analysis: Whole-brain voxelwise or ROI-based?
- Estimate effect size:
- From pilot data (preferred): extract effect sizes from pilot activation maps
- From literature: find the most comparable published study; correct for publication bias by assuming the true effect is ~50-75% of the published estimate (Button et al., 2013)
- From meta-analysis: use NeuroSynth or BrainMap to estimate typical activation strength
- Choose the power analysis tool:
- ROI-based: Standard power calculation (G*Power) using the estimated effect size at the ROI
- Whole-brain: fMRIpower, NeuroPowerTools, or simulation
- Set target power: 80% (conventional) or 90% (recommended for costly neuroimaging studies)
- Account for attrition: Add 10-20% to planned N for participant exclusions due to excessive motion, incomplete data, or technical failures
- Report: Effect size source, power tool used, correction method, target power, final N
For a New EEG/ERP Study
- Estimate effect size: From pilot data or published ERP studies (see effect size table above)
- Determine trial count: At least 30 trials per condition post-rejection (Boudewyn et al., 2018)
- Plan for trial attrition: Assume 20-30% trial rejection rate; collect accordingly
- Subject-level power: Use G*Power with the estimated within- or between-subject effect size
- Account for subject attrition: Add 15-20% for exclusions due to excessive artifacts
Common Pitfalls
- Using uncorrected power estimates for whole-brain analyses: A study with 80% power at p < 0.001 uncorrected has far less than 80% power after FWE or FDR correction (Mumford & Nichols, 2008)
- Ignoring effect size inflation in pilot studies: Small pilot studies produce inflated effect sizes due to the "winner's curse." Assume the true effect is 50-75% of the pilot estimate (Button et al., 2013)
- Applying behavioral power formulas to neuroimaging: Standard t-test power calculations dramatically overestimate power for whole-brain analyses because they ignore multiple comparison correction
- Not accounting for participant attrition: In fMRI, 10-20% of participants may be excluded due to motion, scanner artifacts, or incomplete data. Over-recruit accordingly
- Ignoring reliability for individual differences: Brain measures with ICC < 0.5 attenuate correlations, requiring much larger samples than traditional power analysis suggests (Elliott et al., 2020)
- Assuming published sample sizes are adequate: Most published fMRI studies are underpowered (median power ~20%; Button et al., 2013). Do not use published N as a benchmark
- Neglecting the impact of design efficiency: An optimized event-related design can be 2-3x more efficient than a suboptimal one (Dale, 1999), effectively increasing power without adding subjects
Minimum Reporting Checklist
- Target effect size and its source (pilot data, literature, meta-analysis)
- Effect size metric used (Cohen's d, r, partial eta-squared)
- Power analysis method (analytical, simulation-based, tool used)
- Target power level (typically 80% or 90%)
- Statistical test assumed (one-sample t, two-sample t, correlation, ANOVA)
- Multiple comparison correction method and parameters
- Planned N and justification
- Attrition allowance (expected exclusion rate)
- For simulation-based: number of simulations, pilot data source, software
- For reliability-dependent designs: reliability estimates and their source
References
- Birn, R. M., Molloy, E. K., Patriat, R., et al. (2013). The effect of scan length on the reliability of resting-state fMRI connectivity estimates. NeuroImage, 83, 550-558.
- Boudewyn, M. A., Luck, S. J., Farrens, J. L., & Kappenman, E. S. (2018). How many trials does it take to get a significant ERP effect? Psychophysiology, 55(6), e13049.
- Button, K. S., Ioannidis, J. P. A., Mokrysz, C., et al. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.
- Cassidy, S. M., Robertson, I. H., & O'Connell, R. G. (2012). Retest reliability of event-related potentials: Evidence from a variety of paradigms. Psychophysiology, 49(5), 659-664.
- Cohen, M. X. (2014). Analyzing Neural Time Series Data: Theory and Practice. MIT Press.
- Dale, A. M. (1999). Optimal experimental design for event-related fMRI. Human Brain Mapping, 8(2-3), 109-114.
- Desmond, J. E., & Glover, G. H. (2002). Estimating sample size in functional MRI (fMRI) neuroimaging studies. Journal of Neuroscience Methods, 118(2), 115-128.
- Durnez, J., Degryse, J., Moerkerke, B., et al. (2016). Power and sample size calculations for fMRI studies based on the prevalence of active peaks. bioRxiv, 049429.
- Eklund, A., Nichols, T. E., & Knutsson, H. (2016). Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. PNAS, 113(28), 7900-7905.
- Elliott, M. L., Knodt, A. R., Ireland, D., et al. (2020). What is the test-retest reliability of common task-functional MRI measures? Biological Psychiatry, 87(11), 934-948.
- Genovese, C. R., Lazar, N. A., & Nichols, T. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15(4), 870-878.
- Hayasaka, S., Peiffer, A. M., Hugenschmidt, C. E., & Laurienti, P. J. (2007). Power and sample size calculation for neuroimaging studies by non-central random field theory. NeuroImage, 37(3), 721-730.
- Joyce, K. E., & Hayasaka, S. (2012). Development of PowerMap: A software package for statistical power calculation in neuroimaging studies. Neuroinformatics, 10(4), 351-365.
- Luck, S. J. (2014). An Introduction to the Event-Related Potential Technique (2nd ed.). MIT Press.
- Marek, S., Tervo-Clemmens, B., Calabro, F. J., et al. (2022). Reproducible brain-wide association studies require thousands of individuals. Nature, 603(7902), 654-660.
- Mumford, J. A., & Nichols, T. E. (2008). Power calculation for group fMRI studies accounting for arbitrary design and temporal autocorrelation. NeuroImage, 39(1), 261-268.
- Nichols, T. E., & Hayasaka, S. (2003). Controlling the familywise error rate in functional neuroimaging: A comparative review. Statistical Methods in Medical Research, 12(5), 419-446.
- Poldrack, R. A., Baker, C. I., Durnez, J., et al. (2017). Scanning the horizon: Towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18(2), 115-126.
- Smith, S. M., & Nichols, T. E. (2009). Threshold-free cluster enhancement. NeuroImage, 44(1), 83-98.
- Thirion, B., Pinel, P., Meriaux, S., et al. (2007). Analysis of a large fMRI cohort: Statistical and methodological issues for group analyses. NeuroImage, 35(1), 105-120.
- Worsley, K. J., Marrett, S., Neelin, P., et al. (1996). A unified statistical approach for determining significant signals in images of cerebral activation. Human Brain Mapping, 4(1), 58-73.
See references/ for detailed simulation examples and effect size lookup tables.