Comments on Sample Size and Power of an Experiment

Zack Florence
7 min readApr 7, 2023

By L. Zack Florence

Motivation

Research experiments are motivated by many factors including the researcher’s objectives, funding, physical and staff resources. Planning an experiment and creating the experimental design are the preferred ways to manage as best possible the experimental errors. e.g. sampling error, instrument error, human error, measurement error and, decision errors. The design of an experiment is the best form of due diligence because the design defines the statistical model to which data will be fit. The estimates of the treatment effects with respect to the null hypothesis (Ho: no effect) defined during planning can be judged by setting levels of the two major decision criteria when assessing an experiments result. Table 1 shows the decision matrix which defines the relationships among four statistical outcomes.

When an experiment uses animals the Three Rs should be applied: replacement, reduction, and refinement (Canadian Council on Animal Care: https://www.ccac.ca/).

Table 1. Type 1, Type 2 errors and power of an experiment.

Adapted from: Zar, J.H. Biostatistical Analysis (1999).

In Table 1, if Ho, is found to be false at P≤ 0.05 the treatment effect is deemed significant with a 5% or less chance of a Type 1 error (a false positive). Therefore, I can conclude that I have made the correct decision with a 1 in 20 (5%) chance I was wrong. My confidence in my decision is ≥ 95%. Type 2 error, a “false negative”, is the probability I should have rejected Ho but failed to do that. The power of the experiment is 1- β. Customarily, β is set at 0.20 (20%). Murphy and Myors (2004) define power as, “the probability that a study will reject the null hypothesis when it is in fact false”. Stated another way, I am 80% certain I will not make a Type 2 error.

The figure, above, using standard normal deviations on the X-axis, demonstrates the concept of the null hypothesis (Ho, at 0) vs the alternative Ha, at 2, which contains the effect of treatment (s) equal to or greater than the “effect size”; positive or negative effects.

How many samples should be used in an experiment? If there is previous knowledge or rich literature, the need for detailed analysis to determine samples can be less an issue. However, there will be times when little is little information and planning of an experiment needs more attention and reliance upon certain assumptions. Today with our computing power, it is very easy to simulate most any experiment to perform “what if” analyses. One analytical approach that is often recommended is power analysis. The following four metrics form intimate relationships as part of that exercise:

  1. Sample size: n, the number of experimental subjects/units.
  2. Effect size: for mean effects, the basic form is, Mean Difference= x̅1 — x̅2/ s, where s is the standard deviation among the experimental subjects, simple or pooled.
  3. Significance level = P(Type I error: α) = probability of detecting an effect that is not true, i.e. when the null hypothesis (Ho) is rejected but in fact it is true .
  4. Power = 1 — P(Type II error: β) = 1 — β, where β is the probability of detecting an effect that is true.

If there is an archive of work by a scientist using similar strains of animals, experimental plots, habitats and so forth, the historical variation, sizes of expected effects and sample sizes make planning much easier and power analysis is less critical. Simulations with historical data serve well using “what-if” experiments during planning. However, because effect size and samples needed are often the most critical factors in design, power analysis can be extremely worthwhile: Cohen (1989) is often cited as a good starting point when doing power analysis: more recently Zar (1999) and Murphy and Myors (2004) also provide guidance with examples.

Computing Resources

There is no lack of software resources to perform power analysis and/or simulations during project planning. For example:

· Methods in R statistical software may be found here: https://www.statmethods.net/stats/power.html

Also Includes notes and formulae.

· http://www.biostathandbook.com/power.html

G*power free software

http://www.gpower.hhu.de/

· Code in Microsoft Excel is online to perform power analysis: http://www.real-statistics.com/sampling-distributions/statistical-power-sample/

· G*Power: https://en.wikipedia.org/wiki/G*Power

· One source that this author has used for many years is MINITAB. Originally developed in the early 1970s, it has over time developed a strong reputation in QA/QC, statistical analysis and experimental planning.

Examples

1. One sample t-Test: I wish to find the number of male animals (n) required to detect a shift ≥ 2 White Blood Cell units (1 WBC unit= 1 x 109 /L) from the population mean when α= 0.05 and β=0.20 with Power ≥0.80

The mean (x̅) and standard deviation (sd) of white blood cells in untreated male Sprague-Dawley (S-D) rats were found in the literature: Lillie, et al. (1996) published data for 16 hematological and 22 clinical chemistry parameters. For males the mean and sd for WBC were: x̅ ± sd= 7.84±2.95. Figure 1 displays the results for the 2-sided ourcomes. It seems I would need at least 20 males to detect a shift of 2 WBC units, or an effect ± 26% from the mean, at P≤ 0.05 and β≤0.20 and power= 0.80. The power with n=20 detected a shift less than one sd.

Figure 1. Power curve for detecting at least ± 1 x 29 WBC shift from the mean in untreated S-rats. Analysis in MINITAB.

2. Paired t-test: Untreated male S-D rats were to be assayed for their WBC. Each was then to be challenged with an experimental antibiotic and resampled following 24 hr. How many males will be needed to detect an effect, up or down, of 0.5 sd (0.5 x 39) in WBC?

Figure 2 provides the results. I will need at least 39 male S-D rats to attain a power ≥ 0.80 for the 2-sided outcomes.

Figure 2. Power curves for paired-t samples required to detect an average difference in WBC of at least ± 0.5 x 39 with power ≥ 0.80.

3. Testing means of 2-independent samples: 10 S-D male rats and 10 S-D females are assayed for lymphocytes (absolute, x109 , L-1; see Lillie et al. 1996). The means were estimated to be 6.88±2.63 and 4.45±2.29, respectively. I compared the means and found the difference to be 2.43: males had higher concentrations than the females.

The power of the test was only 0.60 when using 10 + 10=20 animals and the Type 2 error was 0.60. Figure 3 summarizes the results showing that in order to achieve a power of ≥ 0.80, I should have needed at least 17 animals of each sex, a total of 17+17= 34 S-D rats.

Figure 3. Having a mean difference of ± 2.43 units of absolute lymphocytes would require at least N=17 males +17 females= 34, to attain a power of my test of 0.80 or more.

4. Estimating frequencies or proportions: The Offshore Killer Whale (OKW) population in BC waters is at risk for several reasons (DFO 2014). The Potential Biological Removal (PBR) of 0.55 animals/year indicates that anthropogenic mortality could be a limiting factor toward sustaining the population, even given survival rates of 0.98 (95% CI= 0.92–0.99). I am interested in asking this question: How many animals would I need to sample to detect a decline in survival below 0.92 (92%) with power=0.80 or greater? The resident population is estimated at ~300 animals.

Results shown in Figure 4 tells us: if I retraced the usual transects during a pre-determined time period in the off-shore waters I will need to detect at least 68 whales to feel comfortable that the population is not decreasing below 92% of the total. I could conclude that the population is likely maintaining itself at ≥ 276 animals. That means sampling ~ 14% of the total known population in each survey.

Figure 4. Results of power analysis of a 1-sided test to determine the number of BC whales to sample and detect a decrease in numbers below 92% of the known population.

5. More complex statistical designs. Power analysis can be applied to all experimental designs, although multivariate analysis may present some special challenges, e.g. 1-way ANOVA, factorial, correlation, regression and so forth. The main ingredients are choosing an effect size and a reasonable “guesstimate” of the experimental error. An example for a regression analysis would simply be: what strength of association (the regression coefficient (r)) do I want to detect? You can begin by asking: how much of the total variation do I want explained by my model, i.e. r2 (coefficient of determination)? For example, if r=0.50, then r2=0.25 or 25%.

In the simple bivariate case, the number of samples would hold for both regression and correlation.

More about the author can be found here: https://sites.google.com/site/zfconsulting99/home .

References

Cohen, Jacob (1988). Statistical Power Analysis for the Behavioral Sciences (2nd Edition). Pub. Lawerence Earl-Baum Associates, USA.

DFO. 2014. Recovery Potential Assessment of Offshore Killer Whales off the Pacific Coast of Canada. DFO Can. Sci. Advis. Sec. Sci. Advis. Rep. 2014/047.

Lillie LE1, Temple NJ, Florence LZ (1996). Reference values for young normal Sprague-Dawley rats: weight gain, hematology and clinical chemistry. Human and Experimental Toxicology 15: 612–616.

Minitab Statistical Software: http://www.minitab.com/en-us/

Murphy, K.R. and B. Myors (2004). Statistical Power Analysis, 2nd ed. Lawerence Earl-Baum Associates, USA.

Zar, J.H. (1999). Biostatistical Analysis, 4th ed. Pub. Prentice Hall, USA.

Graf:https://en.wikipedia.org/wiki/File:Statistical_test,_significance_level,_power.png ; Lisa Sullivan, PhD,Professor of Biosatistics, Boston University School of Public Health: https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/bs704_power_print.html

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Zack Florence
Zack Florence

Written by Zack Florence

My knowledge is a work in progress.

No responses yet

Write a response