
3 The theory
3.1 Types of data
Data can be qualitative or quantitative. Examples of qualitative data include
nominal: these are categories whose order does not matter, e.g., wild type, knockout, heterozygous, treated, untreated, young, old etc.
ordinal: these are categories where the order matters (i.e., ordered nominal). These are used in surveys (very good, good, bad, very bad), pain scales (1-5). happiness scales etc.
binary: where only two outcomes are possible, e.g., live or dead, wild type or genetically altered, infected or uninfected. In microbiological MIC (minimum inhibitory concentration) determination, we typically record growth/no growth endpoints at defined antibiotic concentrations; even though antibiotics are tested at 2-fold concentration gradients, the measured outcome is binary (resistant or sensitive based on antibiotic breakpoints).
These examples in Figure 3.1 from the beautiful art from Allison Horst makes the point very well!

Quantitative data can be discreet (integral, e.g. number of mice) or continuous (e.g. cytokine level in serum). Quantitative data can also be intervals (e.g. difference between two temperatures) and ratios (e.g. fold-changes, percentages). This book deals mainly with continuous variables and the normal or log-normal distributions. These are the kinds of data we collect in most laboratory experiments and are the focus of this document. I only cover analyses of quantitative data using parametric tests. I plan to add other kind of analyses in the future. Almost all parametric tests have their non-parametric variants or alternatives, which I don’t discuss here. If you use non-parametric tests, it may be more appropriate to plot data as median
3.2 Probability distributions
A probability distribution is the theoretical or expected set of probabilities or likelihoods of all possible outcomes of a variable being measured. The sum of these values equals 1, i.e., the area under the probability distribution curve is 1. The smooth probability distribution curve is defined by an equation. It can also be depicted as bar graphs or histograms of the frequency (absolute number) of outcomes grouped into “bins” or categories/ranges for convenience. Just as data can be of different types, probability distributions that model them are also of different types.
Continuous variables that we come across in our experiments are often modelled on the normal distribution. But let us start with a simpler distribution – the binomial distribution – to understand the properties and uses of probability distributions in general.
3.2.1 Binomial distribution
Binomial distributions model variables with only 2 outcomes. The likelihood of one outcome versus the other can be expressed as a proportion of successes (e.g., in the case of a fair toss, the probability of getting a head is 0.5). Some examples in microbiology and immunology with binary outcomes include:
classification of bacterial strains resistant/sensitive based on antibiotic breakpoints
sensitivity to a host to a pathogen expressed as proportion of individuals that become infected versus those that remain uninfected
proportional reduction of infected cases after administration of an effective vaccine
mating of heterozygous mice carrying a wild type and knockout allele of a gene where we theoretically expect 25 % litter to be homozygous knockouts and 75 % to not be knockouts.
We use the framework of the binomial distribution to perform calculations and predictions when encountering such data, for planning experiments or answering questions such as: Is the proportion of resistant strains changing over time? Has the vaccine reduced the proportion of infected individuals? Does the loss of a gene affect the proportion of homozygous knockout pups that are born?
To understand the distribution of binomial outcomes, let us take a simple example of head/tail (H/T) outcome in the toss of a fair coin (probability of H or T = 0.5). This is an example where each outcome has an equal probability, but this is not a requirement for binomial outcomes. Let’s say we toss a coin 5 times, which would have 25 = 32 possible outcomes with combinations of H or T. We can plot a histogram of the probabilities of observing exactly 0, 1, 2, 3, 4 or 5 heads to draw the binomial distribution.
These likelihoods are plotted in the histogram shown in Figure 3.2. The X-axis shows every outcome in this experiment, which are also called quantiles of the distribution. The Y axis is the likelihood or probability – the area under the probability distribution curve adds up to 1. A binomial distribution is symmetric, and not surprisingly in this case, 2 or 3 heads are the most likely outcomes!
Warning: package 'ggplot2' was built under R version 4.4.3
For example, there is only 1 combination that gives us exactly 0 heads (i.e., when all tosses give tails; likelihood of 1/32 = 0.03125) or 5 heads (likelihood 1/32 = 0.03125). More generally, the number of combinations in which exactly
It is easy to see a couple of useful features of probability distributions:
Calculating the likelihood of combined outcomes: the probability of obtaining up to 2 heads is sum of probabilities of obtaining 0 or 1 or 2 heads = 0.03125 + 0.15625 + 0.3125 = 0.5. This is the area under the lower tail of the distribution up to the quantile 2. Because of the symmetry of the curve, the chance of obtaining more than 2 heads is the same as 1 - probability of up to 2 heads (1 - 0.5); this is the same as the area under the upper tail for the quantile 3 or higher.
Features of the tails of a distribution: the tails contain the less likely or rare outcomes or outliers of the distribution. For example, it is possible, but quite unlikely, to get 0 heads when you toss a coin 5 times. However, if you do get such an outcome, could you conclude that the coin may be biased, i.e., for that coin the chance of H/T is not 0.5?
The framework of probability distributions helps us define criteria that help answer these kinds of questions. But first we need to define the significance level (
With 5 coin tosses, we see that the chance of obtaining 0 or 5 heads is 0.03125 + 0.03215 = 0.0625 (6.25 % chance of 0 or 5 heads). From the distribution we see that values outside of the tails have a much higher probability; 1 - 0.0624 = 0.9375 or 93.75% chance. In other words, the quantiles of 0 and 5 mark the thresholds at 6.25 % significance cut-off (3.215 % equally in each tail). We could call our coin fair at this significance level if we observe between 1-4 heads in 5 tosses. This is a process we routinely use in null-hypothesis significance testing (NHST) – assume the null distribution and find outliers at a pre-defined significance threshold
As we will see in the section of power analysis,
3.2.2 Examples of binary outcomes
Let us now consider examples of binomial outcomes in microbiology and immunology.
3.2.2.1 Antibiotic resistance
Consider results from surveillance that found 30 % strains of E. coli isolated at a hospital in 2020 were multi-drug resistant and the rest (70 %) were sensitive to antibiotics. This year, 38 out of 90 strains were multi-drug resistant. We want to know whether the rate of resistance has changed year-over-year at 5 % significance cut-off.
We start by assuming the rate is the same (the null hypothesis) – if the rate was the same, we expect 27 resistant strains (30 % of 90). This is a two-tailed hypothesis that the proportion of resistant strains could increase or decrease from 30 %. The critical quantiles at 5 % significance for a binomial distribution of a size 90 and expected proportion of resistant strains 0.3 are 19 and 36 (obtained with the qbinom
function in R). Figure 3.3 shows areas in the tails (0.025 probability each) marking a value one less than the critical values.
We note that the quantile 38 is in the tail of the null distribution at
3.2.2.2 Mating of heterozygous mice
Next consider this mating question we may encounter in immunology: when mice heterozygous for a new allele of a gene are mated (qbinom
). As 7 is less than this quantile, we could reject the null hypothesis that the observed proportion of
Just like we used the framework of binomial probability distributions to make an inference, in sections below we will discuss other distributions, but the principles remain similar.
The thresholds (quantiles) for binomial distributions are obtained in R using the function qbinom
, and the related functions qbeta
(from the beta distribution), qnorm
(normal distribution), and qt
(t distribution). The inverse, i.e., finding the area under the curve up to a given quantile, is done with corresponding functions: pbinom
, pbeta
, pnorm
, pt
.
3.3 The normal distribution
The smooth ideal normal distribution – also called the Gaussian distribution – of probability densities is defined by an equation that we (thankfully!) rarely encounter (you can find the equation here). The mean and standard deviation (SD) are the parameters of the normal distribution that determine its properties. Statistical tests that depend on these parameters of our data (mean and SD) utilise features of the normal distribution we discuss below, and are called parametric tests!
The equation of the normal probability distribution does not take into account number of observations – it is an ideal smooth curve of likelihoods of all possible outcomes assuming a large number of observations (this is analogous to the theoretical chance of H or T in a toss of a fair coin is 0.5; in reality, this is can be measured only when a coin is tossed many many times).
Therefore a corollary is that, SEM (standard error of mean), which takes into account sample size is not a parameter of a distribution. Let’s first start with a microbiology example to dig deeper into the normal distribution.
The ideal normal distribution is perfectly symmetric around the mean (which is also its median and mode) and its shape is determined by the standard deviation. Most likely values of the variable are distributed near the mean, less likely values are further away and the least likely values are in the tails.
In general, parametric tests are robust and perform well even when dealing with approximately normal distribution, which is more common in the real world. These are generally more powerful than their non-parametric counterparts.
3.3.1 The E. coli doubling time example
I’ll use a simple microbiology experiment – finding the doubling time (also called generation time) of Escherichia coli – to illustrate how by knowing just the mean and SD and using the equation of the normal probability distribution we can predict the likelihood of observing any range of theoretical values of this population. This theoretical population can be imagined as the doubling time from all E. coli that ever lived – for this example, we will assume the doubling time in Luria broth is exactly 20 min with an SD of exactly 2 min. In the real-world, we will never actually know these parameters for the population of all E. coli. This is often also the case for many variables we measure in the laboratory! For the sake of satisfying the normal distribution, below I provide an example of 5000 simulated doubling times (a fairly large number of observations – note that large clinical trials can include thousands of individuals!).
The graph on top in Figure 3.4 shows 5000 doubling times of our population of E. coli. The histogram of frequency distribution on the Y axis and bins (not labelled) of doubling times on the X axis (graph in the middle) reveals the central tendency of the population. The graph at the bottom is a smooth fit to the perfect normal distribution.
Most values of the doubling time are near the mean of 20 min (the most frequent or most likely observation), but we do have outliers in the tails. Just as in the simple case of the binomial distribution we discussed above, the probability or P values for a range of doubling times on the X axis are calculated as the area under the curve, for example, the probability of observing doubling times up to 15 min or higher than 25 min.
By knowing (just) the mean and SD, the shape of the probability distribution of sample means can be plotted using an equation. These two numbers are the parameters of the distribution, and tests based on related distributions are called parametric tests.
The normal distribution has key features that are useful for comparison and predictions: ~68 % of all values are within
So, values at least as low as 16 or lower have a likelihood of 0.025 (area under the lower tail), and values at least as high as 24 and higher have a likelihood of 0.025 (area under the higher tail). Here, 16 and 24 are the critical values at 5 %
By convention, 5 % samples that lie outside the
This property of distributions and the likelihood of rare values, i.e., values that are in the “tails” of the distribution, are the bases for statistical inferences.
We can also choose to be more stringent and consider 1 % samples (outside the ~
3.3.2 Sample (experimental) means
In the above section we were discussing the mean and SD of the population of all E. coli. But in reality, we will never know these parameters for E. coli doubling times (or any other quantitative value in general). The experiments we perform in the laboratory to obtain the doubling times are samples from the theoretical population of E. coli. Our goal is to obtain representative, accurate and independent samples to describe the true population parameters. Therefore, in statistical terminology, experimental data we collect give us the sampling distributions rather than population distributions. Well-designed experiments executed with accuracy will give us values of double times – these are sample means which we hope are close to the true population mean (i.e., ~20 min) and SD (~2 min).
To reiterate, we will never know the true doubling time of all E. coli ever grown in laboratory conditions (or any other variable we are measuring) – we can only estimate it by performing experiments that sample the population and are truly representative of the population! The principles of distributions of samples will apply to our experimental data. Sample sizes determine experimental reproducibility, which we discuss in the section on power analysis.
Here is another way to think of the example of experiments and samples: let’s imagine that numbers corresponding to all doubling times of the population are written on balls and put in a big bag from which you draw independent, accurate and representative balls – the number of balls drawn is the sample size
Statistical methods based on principles of the normal distribution above can help answer the questions on probabilities, such as:
What is the likelihood that the three numbers you draw have a mean and SD same as those of the population?
What is the chance of drawing numbers at least as extreme as 10 or 30 min (~5SD away)?
What could you infer if independent
draws give you a sample mean of 30 and SD of 2?If 10 different students also drew
independent samples, will they get the same result as you?Should your sample size be larger than 3?
It is often impossible to know the mean and SD of the entire population. We carry out experiments to find these parameters by drawing random, representative, independent samples from the population.
3.3.3 Real-world example of sample means
In this section I want to demonstrate the variability arising from the underlying mathematics of probabilities that affect sampling (i.e., experiments), and how this affects reproducibility and accuracy of our findings.
Let’s say a student measures E. coli doubling times on 3 separate occasions, i.e., statistically independent, accurate, representative experiments using different batches of E. coli cultures performed on three separate days. This is an ideal student – someone with perfect accuracy in measurements and does not make mistakes (here we only want to account for randomness from underlying variability inherent in the variable – doubling time and its mean and SD – we are measuring; we will ignore operator/experimenter or instrumental inaccuracies). Let’s also say that 9 more ideal students also perform
From the “drawing balls from a bag” analogy, each of the 10 students will draw 3 independent numbers from the big bag containing the population of doubling times. We expect all 10 studies to find the true mean of 20 min and SD of 2 min. Let’s see if this actually happens in our simulations!
The results of rnorm
, which generates random numbers from a distribution of a given mean and SD). The results are plotted as the 3 sample values (circles), their sampling mean (square) with error bars representing the 2SD range of sampling distribution; the horizontal line at 20 min is the expected population mean. Recall that in a normal distribution, 95 out of 100 values are within
What is also evident is the large variability in the 2SD intervals. This was not because our students performed experiments badly or made inaccurate measurements - this happened because of the underlying mathematics of probabilities and sample sizes. Unlike this ideal scenario, laboratory-based variability from inaccuracies in measurements will add more uncertainty to our sample (experimental) means & SD.
This example illustrates that the lack of reproducibility can occur even without operator error because of the small sample size and large noise (SD).
As you might have guessed, the 10 students might have had similar means and SDs if they had performed more than
3.3.4 Independent sampling
In statistics “independent sampling” or “independent experiments” are mathematical concepts. We often say ‘biologically independent’ but do not fully appreciate what we actually need when performing experiments.
Let’s take an example of a clinical study testing a new drug. There are several instances where two individuals in a study may not be representative independent units of the population (e.g., all humans for whom the study results should apply).
The example of election polling is perhaps easiest to illustrate the importance of accurate, representative, independent sampling. Polling surveys for voting intentions should only include people from different groups, communities, cities etc. at proportions that reflect the the voting intentions of the whole country, otherwise surveys will be biased and their predictions will be inaccurate. In other words, the survey will not represent the true population.
What about clinical experiments involving human participants or laboratory experiments with mice? Two different individuals may be considered biologically different or distinct, but if these individuals are too similar to each other (e.g. similar age, sex, bodyweight, ethnicity, race, diets etc.) they may not be representative samples of the whole population. Similarly, inbred mice sourced from the same company, housed together in a cage, fed the same chow, water etc. may not necessarily be independent experimental units in every kind of experimental design.
In tissue culture-based experiments, replicate wells in a multi-well tissue culture plate containing the a cell line passaged from the same culture dish may be biologically different wells but not independent in most experimental scenarios.
The independence of observations/measurements we make is very important and you must be careful to recognise truly independent measurements and technical replicates in your experiments (see section on technical and biological replicates below).
Based on the “polling” example, two flasks of LB inoculated from the same pre-inoculum of E. coli and measured on the same day will not be statistically independent – these two flasks would be “individuals from the same household” or “too similar to each other” to represent the true population.
To summarise, most experiments we perform in the laboratory are samples drawn from a hypothetical ideal population. We must ensure our experiments give us independent, representative, and accurate samples from the population. For statistical comparisons, even though we typically think of two individuals as biologically different or distinct or independent, what we need are samples that are truly independent and representative.
3.4 Other distributions - , t, F
Everything we discussed above related to the mean and SD, and likelihood of outliers, of a single parameter we are interested in, e.g., the doubling time of a strain of E. coli. However, in general, we are interested in comparing groups, e.g., using Student’s t tests or ANOVAs. For parametric tests, which are generally more powerful, we rely on the same principle of probability distributions. The main difference is that we will first calculate a test statistic from our data, and assess whether this statistic is an outlier of the null distribution.
We will consider three test statistics and their distributions:
(related to the normal distribution): the distribution is a variant of the normal distribution. It assumes a large number of observations and its shape does not change with sample size.t : this is a family of distributions used for Student’s t test for various sample sizes
. It is symmetric with a mean of 0 (which is the null hypothesis). These are used to infer the probability of observing a value at least as extreme for the t statistic calculated from data, assuming the null hypothesis is true.F : these are a larger family of distributions used in ANOVA based on variances. F values are therefore always positive.
3.4.1 distribution
The
For a given data set, scale
function in R generates
By subtracting the overall mean from all values and dividing by the SD we get a
We can generate a standardised distribution of the population of E. coli doubling times by subtracting the mean (20 min) from each value in Figure 3.4 and dividing it by SD (2 min). This is shown in Figure 3.7. These
3.4.1.1 Applications of scores
Assumptions of the
If we use a library of mutants of every gene in E. coli and measure their doubling times (e.g. in a high-throughput plate-reader assay), we will have lots of independent values for each mutant strain that we want to compare with the wild-type strain. We can calculate
Another advantage of using
A variant of
3.4.2 Degrees of freedom (Df)
The distributions we’ve encountered so far assume a large sample size. We next consider two distributions that take sample and group sizes in account (the t and F distributions). To understand these, we need to first know about degrees of freedom (Df).
Df is the number of theoretical values a variable can take depending on the sample size and number of parameters you wish to estimate from that sample. The easiest example is of finding a mean from 10 samples – once 9 numbers and the mean are fixed, the 10 th number is already fixed and cannot be freely chosen – so the Df is 9, or sample size – 1 = 10 – 1 = 9.
When comparing groups, e.g., Student’s t test, you would subtract 1 from the sample size in each group, or total sample size – 2. More generally, Df = number of samples (
Df are easy to understand for simple situations, but things can get complicated really quickly. Complicated enough that for ANOVAs with different numbers of samples in different group Df calculations are approximate.
The reason Df are important is that they determine which family of distributions are looked up by t and F tests, and this is central to finding whether the statistic is an outlier at a given significance threshold. As a side note, this is also why the t and F values and the Df from the test are more important to note than just the P value (see extreme values).
Do not include technical replicates for statistical comparisons as they artificially inflate the degrees of freedom and cause software to look up the wrong distribution. Also see the section below on pseudo-replicates to avoid this mistake.
This tutorial and this description on the Prism website are great. If you are interested in more lengthy discussions on Df, look here and here.
3.4.3 t distributions
When performing a Student’s t test on two groups, we calculate a t statistic from the data and ask: if the group means were the same, what is the likelihood of obtaining a t statistic at least as extreme as the one we hvae calculated?
There are some similarities between the t and
The difference between the t and
Let’s go through the above in the case of the doubling time experiment: let’s say you want to compare the doubling times of two strains of E. coli - the wild type (qt
function in R). Note that these are more extreme than
The shape of the t distribution approaches the
3.4.3.1 Student’s t test
Having seen how the principle of a distribution and extreme values is used, let’s briefly go through the principles of a Student’s t test and how a t statistic is calculated – no complicated formulae below, just a description in lay terms.
The Student’s t test is used to compare the means of exactly two groups (for more than 2 groups, we must use ANOVA). The null hypothesis (
The t test calculates the difference between two means (the
On the other hand, if the means of the two groups are different and the pooled SD is small (i.e., a large
The above example was of a two-tailed hypothesis (i.e., the difference between means could be positive or negative). A one-tailed hypothesis is one where the alternative hypothesis only tests for change in one direction i.e., that
Pay attention to the t statistic and degrees of freedom in addition to the P value from the test.
A very large or small t value is less likely if the null hypothesis is true. For small sample sizes, a t value would have to be outside the
3.4.4 F distributions
When comparing more than two groups we use analysis of variance (ANOVA). The null hypothesis of an ANOVA is that all groups have the same mean. This “omnibus” test of all groups at once is much better than multiple t tests between groups; the latter can lead to false-positive results. If the F test is “significant”, we proceed to post-hoc comparisons (see Chapter 5). ANOVAs can involve one or more factors (also called the nominal, categorical or independent variables), each of which can have multiple levels or groups (see examples below). The ANOVA table gives us the F value for comparisons of groups within a factor.
The F value for a factor is a ratio of sum of variance between groups (
Because F values are calculated based on ratios of sums of variances (which are squares of SD), they are always positive.
A one-way ANOVA has one factor which can have multiple levels or groups. In our E. coli example, the factor is Genotype, and levels could be
Two-way ANOVAs have two factors and each may have different levels. For example, we could measure doubling times of three strains of E. coli (i.e., Factor
The
Thus, F distributions belong to an even larger family of probability distributions and adjust for real-world number of groups being compared and total number of samples. Pay attention to the denDf (at least approximately) because the larger the Df the less conservative the critical values for 95 % cut-off will be.
In summary, there are two Dfs that determine the shape of the F distribution. One for the
3.5 Signal to noise ratios
A recurring theme in null hypothesis significance testing (NHST) is the
The difference between means is the
The
3.5.1 Extreme values in distributions
The
Like
3.6 P values
A P value is the likelihood or chance of obtaining a test statistic at least as extreme as the one calculated from data provided assumptions about data collection and fit of model residuals to the distribution are met. (To understand model residuals, we need to go further to Linear Models. Remember that in a statistical test, what matters is that the residuals are approximately normally distributed).
The P value is the area under the curve from the relevant probability density distribution for the test statistic calculated from data. Therefore, given a test statistic such as the pnorm
, pt
and pf
in R). As you see, the test statistic, such as
P values do not tell you anything about how reproducible your findings are. P values only tell you how likely it is to observe the data you have just collected. If you perform the entire study again, there is no guarantee based on this P value (however small it is) that you will get the same or even a low P value again. Power analysis tells us how to improve the reproducibility of a study.
3.6.1 Correcting for multiple comparisons
When comparing more than 3 groups, e.g., post-hoc tests in ANOVA, a correction or adjustment is required for the P values. This correction for multiple comparisons is to avoid false positive results, i.e., getting a small P value just by chance.
Multiple comparisons corrections is easy to understand conceptually, but often easy to forget to implement. Let’s say you want to compare doubling times of 3 strains of E. coli at 4 temperatures in 3 types of broth media. You will have a lot of comparisons (3 x 4 x 3) and lots of P values (NOTE: you should never do multiple t tests in any scenario where you are comparing more than 2 groups from the same experiment! In this example, we would do an ANOVA). When comparisons many, many groups, it is not that unlikely that some of these P values will be less than 0.05 just by chance! To avoid such a false-positive result (i.e., rejecting the null hypothesis when it is actually true), our P value cut-offs should be lower than 0.05 – but by how much? Instead, we adjust the P values we obtained from tests to correct them for multiple comparisons. Bonferroni, Tukey, Sidak, Benjamini-Hochberg are commonly used methods.
Bonferroni’s correction is too strict and you lose power when comparing more than a few groups, so this is less frequently used. Any of the others will do: Tukey & Dunnett are very common and so is the false discovery rate (FDR)-based method. You should describe which one you used in the methods section of reports/manuscripts.
The FDR-based correction allows “false discoveries” that we are prepared to accept, typically set at 0.05 (called
As a rule-of-thumb, when making more than 3 comparisons from the same data set, P values should be corrected or adjusted for multiple comparisons. The emmeans
package for post-hoc comparisons allows various corrections methods.
3.6.2 Reporting results
In technical reports, the results of t and F tests are reported with all key parameters that helps the reader make better inferences. The convention is as follows:
test statistic (numDf, denDf) = t or F value from the data; P = value from the test.
For t tests numDf is always 1 so, it is reported as t(Df) = value from test; P = value. Also see results for independent group test and paired-ratio t tests in Chapter 4. State whether it is a one-sample or two-sample test, whether the P value is two-tailed or one-tailed, and whether the t test was independent, paired-ratio or paired-difference.
For ANOVAs, the F value for each factor has two Dfs and it is reported as follows: F(numDf, denDF) = value from ANOVA table; P = value from ANOVA table. See results for a one-way ANOVA here in Chapter 4 and here in Chapter 5.
Explicitly reporting the test statistic (t or F
3.7 SD, SEM and confidence intervals
Mean and SD are the two parameters that describe the normal probability distribution, but there are other ways of describing scatter with error bars. See Appendix for example on calculating these in R for a long format table.
3.7.1 SEM
SEM (standard error of the mean) is often used to depict the variability of our sampling estimate. With sample standard deviation
3.8.1 Confidence intervals
Confidence intervals (
With large sample size that meet the assumption of a
If we have a small sample size (qt
in R to get critical values).
But what does the
3.9 Confidence intervals
A given
A given
The higher the probability level (greater chance of including the mean), the wider the interval will be.
Like
3.9.1 Confidence intervals and the distribution
Figure 3.8 shows the plot of experiments by 10 students with mean and
3.9.2 Confidence intervals and the t distribution
For smaller sample sizes the
For smaller sample sizes,
Let’s see
Compare Figures Figure 3.6 (2SD error bars), Figure 3.8 (
3.9.3 Confidence intervals and estimation statistics
Another application of
Confidence intervals are well described in the Handbook of Biological Statistics by John McDonald. Find out more in the section on Effect Size & confidence intervals. This excellent website dedicated to Estimation Statistics can be used for analysis and graphing. Also read this article for an explanation and a formal definition.
As discussed below, it is also easier to understand power analysis from confidence intervals. In this section we assumed large sample size, even though we should not have (our students only did
3.10 Linear models
In sections above we discussed probability distributions of continuous variables and test statistics, such as the
A “linear model” is a technical term for an equation that models or fits or explains observed data. It also helps us make predictions and understand patterns in data and their scatter. The simplest example of a linear model is a straight line fit to observed data, which is what we begin with and move to how a Student’s t test (and ANOVAs) can be solved with linear models.
Some useful terminologies of a line in XY coordinate system before we go further:
slope or coefficient – the slope of the line is also called the coefficient in linear models
intercept – the value of Y when X = 0.
outcome variable or dependent variable – the quantitative variable plotted on the Y axis. Typically, Y is the continuous variable we measure in experiments (e.g., the doubling time of E. coli).
dependent variable – the categorical or continuous variable plotted on the X axis. In Student’s t tests and ANOVAs, these are categorical variables, also called fixed factors. Examples include Genotype, Sex, Treatment.
Factors and their levels – levels are categorical groups within a factor, e.g.,
and groups or levels within the factor .
Next we consider the simple case of fitting a straight line and then their use when comparison of two groups (Student’s t test) with linear models.
3.10.1 Fitting a line to data
The equation for a straight line is
Let’s start by plotting a straight line through a set of observed data e.g. increasing values of Y with increasing X as shown in Figure 3.10. We use this equation often when plotting standard curves and to predict unknown values, for example, in ELISAs to determining concentrations of test samples.
With data like in Figure 3.10, several lines with slightly different slopes and intercepts could be fit to them. How do we decide which is the line that is the best fit to the observed data? A common way is to use find the ordinary least square fit (OLS fit).
For each line fit to data we calculate the difference between the actual observed value (i.e., observed data) and the value predicted by the equation of the line (i.e., the data point on the line). This difference is called the residual.
The residuals for all data points for a line are squared and added to calculate the sum of squares of the residuals. The line with the lowest sum of square of the residuals is the best fit – therefore called the ordinary lesast square fit (OLS fit).
The observed data are also equally spaced above and below, so the sum of the residuals is zero. In the OLS fit, the “sum of squares of residuals” is necessary because otherwise residuals cancel each other out. Even though this is mathematics 101, biologists are rarely taught this: this website has a simple explanation of how software fits linear lines to data.
A key feature of a good OLS fit is that the residuals have a mean of 0 and they are normally distributed. Figure 3.10 shows a scatter graph of the residuals for the fit have a mean = 0, SD = 0.3.
So we can get observed data from the estimated value on the line by adding or subtracting a non-constant residual term – also called an error term denoted with
So the updated equation of a of the linear model is:
Slopes of linear models are also called coefficients. Error term is often used to refer to the residual (but strictly speaking they are slightly different).
Residuals from a good linear model are normally distributed.
In the next section we see how t tests (and ANOVAs) are just linear models fit to data. Therefore, these tests perform better (i.e., are more reliable) if the residuals are small, normally distributed, and independent. Note that it is not the raw data – the residuals of the test should be at least approximately normally distributed. Therefore, after doing an ANOVA or Student’s t test it is good to check residuals on a QQ (observed Quantile vs predicted Quantile) plot. Conversely, if the residuals are skewed, this is a sign that the linear model (or the result of the Student’s t test or ANOVA) is a bad fit to data. Examples of approximately normal and skewed residuals are shown below in Figure 3.11.
3.10.2 t test & the straight line
In this section, we will see how a t test can be solved by fitting an OLS line to the data from two groups. The same principles apply when comparing more than two groups in one- or two-way ANOVAs (or factorial ANOVA in general) using linear mixed models discussed in the next section.
Recall that the t statistic is the
Now let’s compare the doubling times of
The data in Table 3.1 are plotted as boxplots in the graph on left in Figure 3.12. This graph yas a numeric Y axis and categorical X axis, which is the correct way of plotting factors. The t.test
which we will encounter in Chapter 4.
Two Sample t-test
data: Doubling_time_min by Genotype
t = -8.717, df = 8, p-value = 2.342e-05
alternative hypothesis: true difference in means between group WT and group Δ mfg is not equal to 0
95 percent confidence interval:
-13.960534 -8.119466
sample estimates:
mean in group WT mean in group Δ mfg
19.22 30.26
In the results, we see that the t statistic = -8.717, degrees of freedom = 8, and P value is 2.342e-05. The t statistic if the null hypothesis is true is 0, and -8.717 is a value far from 0 in the tails of a t distribution of Df 8. The output also includes the 95 % confidence interval, and the means of the two groups.
The way to obtain the t statistic with a linear model is shown in graph on right in Figure 3.12. To fit an OLS straight line through these data points, we do a small trick – we arbitrarily allocate
This linear fit lets us do some tricks. The slope of a line is calculated as
For these data, residuals for are calculated by subtracting each value from the respective group means; the sum of squares of residuals is then used to find the OLS fit. Recall that the calculation of SD and variance also involves the same process of subtracting observed values from means and adding up their squares!
We can also calculate the standard error (SE) of the slope of the OLS line (here’s the formula), which is the same as the pooled SD calculated for a t test. This is the denominator or
Technically, the linear model is the equation of the line that models (or predicts or fits or explains) our observed doubling time data. The linear equation explains the change in doubling time with Genotype, and predicts the doubling time when Genotype changes from
As with linear fits, for t tests and ANOVAs to work well, residuals should be normally or nearly normally distributed.

Taken together, by fitting a line through two group means, where categorical/nominal factors were converted into dummy 0 and 1 for convenience, we got the slope and the SE of the slope, the ratio of which is the t statistic. The degrees of freedom are calculated from total numbers of samples as we discussed above, i.e., total number of samples – 2. We’ve just performed the t test as a linear model! R does this kind of stuff in a jiffy. You can see this in action in the result of t test as a linear model in here in Chapter 4.
The residuals of linear models should be nearly or approximately normally distributed for the test result to be reliable.
Residuals can be normally distributed even when the original data are not normally distributed!
Skewed distribution of residuals suggests the model does not fit the data well and the result of the test will be unreliable.
Diagnostic plots of data and residuals, for example QQ or histogram plots, are therefore very useful. Simple data transformations can often improve the distribution of residuals and increase power of the test.
Like the conventional Student’s t test, conventional ANOVA calculations are done differently from linear models. However, linear models will give the same results for simple designs. ANOVAs as linear models are calculated with dummy 0 and 1 coding on X axis and sequentially finding slopes and SE of slopes. We see this in the next section, and Chapter 5.
In R, straight lines through X and Y points are fit using Y ~ X formula. Similarly, linear models for t tests and ANOVAs are defined by formula of the type Y ~ X, where Y is the quantitative (dependent) variable and X is the categorical (independent) variable. This is “read” as Y is predicted by X. R packages (e.g., lm()
or lmer()
) are popular for linear models and linear mixed effect models, where the data are passed on as Y ~ X formula where Y and X are replaced with names of variables from your data table.
3.10.3 Further reading on linear models
I am surprised we aren’t taught the close relation between t tests and ANOVAs and straight lines/linear models. There are excellent explanations of this on Stack: here, here, here and here.
This website by Jonas Kristoffer Lindeløv also has an excellent explanations of linear models.
3.11 Linear mixed effects models
A simple linear model is used when comparing independent groups where there is no hierarchy or matching in the data. In linear mixed effects models, intercepts and/or slopes of linked or matched or repeated observations can be analysed separately. Without getting into the mathematical details – this reduces
Linear mixed model are very useful for the analysis of repeated measures (also called within-subject design), linked or matched data. As we will see in examples below, matched data do not necessarily come from the same individual or subject in all experimental scenarios. These kinds of data are often handled with the lme4
package in R, a version of which is also implemented in the [grafify]
package; see Chapters 4, 5, 6 and 7).
So what does mixed stand for in mixed effects? It indicates that we have two kinds of factors that predict the outcome variable Y – fixed factors and random factors. Fixed factors are what we are interested in, e.g. “Genotype” within which here we have two or more levels (e.g. lm
) only analyse fixed factors. Random factors are unique to mixed models, which we discuss next.
3.11.1 Random factors
What is a random factor? The answer to this question can vary and is best illustrated with examples. Random factors are a systematic ‘nuisance’ factors that do have an effect on observed values within groups but we are not interested in comparing them. If we can group them into a separate random factor, which enables us to reduce the total
A common example of a random factor is “Experiment” – for example, if doubling times are measured for two Genotypes of E. coli (
We saw in the above example of independent groups, that the linear model had a single intercept for
Other example of a random factors are “Subjects” or “Individuals” or “Mice” in an experiment where we take repeated measurements of the same individual, e.g., measurements over time or before and after treatment with a drug. In such repeated-measures designs the “participants” are levels within the random factor “Subjects”. Each Subject could be allowed to have an intercept in the linear model.
The syntax in mixed models has an additional type of term: Y ~ X + (1|R), where X is the fixed factor and R is the random factor. (1|R) indicates we allow random intercepts (I won’t be discussing more complex models that allow random slopes (e.g. (X|R)) in this document). The syntax is simpler in grafify
.
3.11.2 E. coli example for linear mixed effects
Let’s say we want to compare the doubling times of three strains of E. coli, i.e.
However, when we plot each experiment separately, as shown in the panel of graphs below in Figure 3.14 (note different scales on Y for each graph), we can see that in all experiments
The net result is a marked reduction in
This nicely exemplifies the power of linear mixed models in detecting small differences despite large experiment-to-experiment variability. Also see the section on Experimental Design on how to plan experiments to benefit from this kind of analysis.
3.11.3 Example of an experiment with mice
Let’s say we measure the level of a cytokine in the serum of 5 mice, each given placebo or a drug in 5 experiments (5 x 5 x 2 = 50 mice). “Experiment” can be a random factor here too, as different “batches” of mice or drug preparations may behave differently. Instead of ‘pooling’ data of all 25 mice per group, analysis with mixed effects will be more powerful. In fact, powerful enough that could reduce the number of mice per experiment! Indeed, with back-crossed strains of mice, given exactly same chow and kept in super-clean environments, we must ask ourselves whether each mouse is a biologically independent unit? Would you rather use a lot of mice in one experiment, or few mice spread across multiple experiments? Analysis of data from mouse experiments with mixed effects models is shown in Chapter 6).
3.11.4 Example of repeated measures
Variability within a group is easier to see from a clinical example involving studies on human subjects. Let’s say you want to find whether a new statin-like drug reduces serum cholesterol levels over time. You choose 25 random subjects and administer them placebo and measure cholesterol levels at week 1. Four weeks later the exact same individuals are given the drug for 1 week, after which we measure serum cholesterol levels for all individuals (more commonly called “Subjects”). Repeated-measures ANOVAs are also called within-subjects ANOVAs.
Here, “Treatment” is a fixed factor with two levels (placebo & drug). If we plot cholesterol levels for all 25 individuals in the placebo and drug group, the graph will look like the one on left in Figure 3.15.
“Subject” is a random factor as we have randomly sampled some of the many possible subjects. Each “Subject” is likely to have different baseline levels of serum cholesterol (i.e. different Y-intercepts). The variance (
We can also allow each subject to have different slopes (i.e. magnitude of difference between placebo & drug on X axis). This may be necessary if there is evidence that individuals with a low baseline may have a “smaller” change with the drug and those with larger levels who may see “larger” change (or vice versa). Linear mixed effects offer a lot of flexibility in modelling observed scenarios. In this document we only consider models with random intercepts (different baselines). Refer to Further Reading for random slopes analyses.
Repeated sampling over time is quite common – whether with experiments with mice, model organisms or even in vitro experiments in multi-well cell culture plates, flasks of bacteria etc. The experimental unit (i.e. “Subject” or “Individual”) is assigned a random factor for repeated-measures analysis and “Time” is assigned as the fixed factor.
Extending the above example further, let’s say our 5 Subjects were recruited to 5 different hospitals (a-e) for “placebo” and “drug” treatments. This kind of a hierarchical or multi-level design can also be handled by mixed models as shown in the graph on right and bottom in Figure 3.15. It appears that Subjects at two hospitals do not show as much of an effect with the drug as compared to other three hospitals. Models can have as many random and fixed factors as supported by the data and experimental design – more complex linear mixed effects models are reliable only when there is sufficient data to fit them.
If values are missing (for instance, if one patient doesn’t turn up for a check-up!) the conventional ANOVA calculation will fail unless subjects with ‘missing values’ are completely excluded. This loss of data is avoided in linear mixed effects analysis as it will still compute the ANOVA with missing values.
3.11.5 Further reading on linear mixed effects
Here are websites by Gabriela Hajduk and Michael Clark with more further descriptions of mixed effects analyses. This visualisation of heirarchical data by Michael Freeman is also great.
3.12 Experimental design
The
Reducing
3.12.1 Randomised block design
Here randomised means that experimental units (e.g. Subjects, mice, LB flasks, culture wells etc.) are randomly assigned to treatment groups. “Block” denotes the random factor. These designs are also called split-plot designs. Randomisations ensures there is no bias in whether a particular mouse or patient is given placebo or drug, or whether any given LB flask receives
The idea of split-plot designs are easy to understand from an agriculture–related example. If you are testing the impact of 5 fertilizers on a crop and you have 5 plots of land in geographically different locations to do it on, the best design is to split each piece of land into 5 split-plots to administer each fertilizer. Which fertilizer is applied to a split-plot should be decided randomly. This design lets you control for unknown factors that affect plant growth (e.g. soil quality, hydration, sunlight, wind etc. at the geographical locations). The geographical locations could affect the baseline level of growth of the crop which we are not interested in – we want to know whether a fertilizer makes a difference irrespective of baseline growth rates across plots of land. Variability across plots is assigned to the random factor or the “blocking” factor and we focus on comparing differences between fertilizers in split-plots at each location. By testing all 5 fertilizers at all 5 plots of land, we reduce within group
Examples of block designs are shown in Figure 3.16. In A is an example of side-by-side experiments on 3 drugs and control set up as technical quadruplicates (4 wells per treatment) placed “randomly” (although this could be done better than my depiction of it!) in a 96-well plate. Each experiment will generate a mean of technical replicates and 5 means for each treatment from 5 experiments will be compared statistically. If we use all technical replicates in our analyses this will lead to pseudoreplication with 4 x 5 = 20 values that artificially increase denDf – this is wrong!
In B, one mouse each of four strains of mice are either “control” or “treated” in two experimental blocks. By reducing the number of mice per group per experiment, this satisfies the principles of 3R as described further in this manuscript by Michael Festing [1]. Analysis of data from this paper is shown in Chapter 6) (and is not possible in Prism as far as I know).
Returning to microbiology, our above E. coli doubling times measured side-by-side analysed by mixed models reduced within Genotype variability. Here the “Experiment” is the block or plot, and each plot is split into 3 for our 3 strains, and flasks of LB should be assigned randomly to each strain. We may also choose to measure each strain in technical replicates, flasks for which should also be randomly assigned. Unknowns in a given experiment (e.g., autoclave cycle, shaker speed, temperature etc.) will affect all 3 strains on a given day similarly, and mixed models analysis will help identify trends between Genotypes across experiments.
It is quite common to perform experiments side-by-side in block designs, but few of us are familiar with mixed effects analysis that are powerful for analysis of such designs. R packages lmer
and nlme
, among others, can be used for this. An easy version of lmer
is implemented in grafify
with a simple user interface.
3.13 Technical versus Biologically independent replicates
Typically, we are interested in determining the inherent biological variability in our system and not our technical accuracy in making measurements. This is where the term biologically independent arises from – remember the examples of polling surveys (same households?) and clinical trials (diverse patients and controls?).
The framework of probability distributions underlying statistical comparisons requires that our sampling from the true population be independent and representative. Technical replicates improve the accuracy of our measurements, but they do not reflect the scatter of the true population and are not independent samples. Biological independence is typically seen when laboratory experiments are set up on completely different occasions with independent sources of biological materials. These could be bacterial colonies, starting cultures, cell lines, primary cells, mice, PBMC donors, and so on. Simply because you have two bacterial cultures or individuals or mice does not mean that they are representative and independent samples of the population!
We previously thought of independent E. coli doubling time experiments as drawing numbers from a big bag – technical replicates would be reading each number 3 times; because you re-read it 3 times does not mean
The average (e.g., mean) of technical replicates from independent experiments should be used for statistical comparisons. Statistics on technical replicates within an experiment is not valid because such observations do not satisfy the assumptions of distributions. Consider the examples listed below that may appear to be independent replicates but may require further thought based on the quantity/variable being measured:
Two flasks of LB inoculated with the same preinoculum of E. coli to measure doubling times – even though there are two separate flasks, they technical replicates because they received the same starting bacterial culture. The two flasks are not independent, representative samples the true population.
Same bacterial culture used for MIC experiments in duplicate or more replicates.
Tissue culture cell lines plated in multiple wells in duplicates/triplicates – the two/three wells are technical replicates in the experiment. They originated from the same culture dish, were plated in the same cell culture medium and will respond similarly to any perturbations such as infection, cytokine or drug treatments etc. that are measured at the level of the whole well (“same household” in the polling survey). A completely independent source of cells i.e., typically an experiment performed on a different day, different passage of cells, different culture dish, would be independent and reflect the underlying biological variability that captures the true population value we are estimating by accurate, independent sampling. See Figure 3.17.
Primary cells prepared from the same mouse and plated in multi-well culture plates would also be technical replicates depending on the quantity being measured. Cells prepared from different mice or different human donors/patients could be independent.
If your experiment involves two biological systems, e.g. infection of eukaryotic cells or mice with a bacterial pathogen, independent experiments should use completely independent batches/sources of both the host (cell lines, mice, patients etc.) and pathogen on different occasions (bacteria from different colonies, newly streaked out agar plates etc.).
Laboratory-based in vitro experiments are typically independent when the whole experiment is performed on different occasions with completely different starting biological materials, including bacterial cultures, cell lines, primary cells, mice etc.
Depending on the experimental setup, two individuals, mice, bacterial cultures may not necessarily be independent samples of the true population.
The average of all technical replicates is the measured value of the parameter from that experiment.

When writing Methods, you should clearly describe how many times you performed the experiment independently on different occasions because “biological independent” is often a confusing term. Statistical tests should not be performed on technical replicates of a single experiment because they violate the requirement of independent sampling – technical replicates should only be used to calculate average values for that experiment and compared to independent averages (see Figure 3.17 above). Technical replicates used in statistical comparisons lead to a what is called pseudo-replication, which we discuss next.
3.13.1 Pseudo-replicates
The t and F distributions look up a given statistical comparison based on the number of independent samples drawn from the true populations (numDf and denDf). The critical cut-off values for P < 0.05 will change if the software looks up the wrong distribution family! The higher the Df (more samples), the lower the cut-off values will be (critical values for
Pseudo-replicates are non-independent values that have been incorrectly used in statistical comparisons when they shouldn’t be – they are not mathematically or statistically independent and do not satisfy the assumptions of sample probability distributions.
Going by the polling analogy, this would be akin to asking the opinions of all members of the same family or close friends who share common opinions that are not representative of the true population.
Only you – the experimentalist who designed the study – will be in a position to definitively answer the questions below that can help you figure out whether you are dealing with pseudo-replicates. Ask yourself what is the experimental unit? Are you measuring a parameter at the level of a population in a tissue culture well or bacterial culture flask or tube? Are you measuring a pooled measure in a field of view in microscopy or populations in flow cytometry? Are you measuring a quantity at the level of every bacterium individually (e.g., single-cell measurements)? Are you measuring single molecules?
Microscopy: are individual cells on a cover-slip “independent” in microscopy experiments? This could depend on the quantity being measured – are you measuring hundreds of cells in a field and generating a summary value such as the “percentage” of cells +ve for an ‘event’? If so, you have already “used up” the data on hundreds of cells to generate one quantity and should lose degrees of freedom. The 100s of cells are not really providing you with 100s of different numbers! On the other hand, if you are measuring fluorescence intensity of a marker on 100s of cells and presenting each intensity separately, you could consider each cell as an experimental unit. However, you should ask yourself is it enough to measure 100s of cells in one experiment or should we repeat this experiment independently multiple times? You should indeed measure 100s of cells in biologically independent experiments. Hierarchical data like these can be analysed using mixed models, which is what most of this document focuses on and the package
grafify
can do.Flow cytometry: in flow cytometry even though parameters for each cell are counted individually, quantities are often expressed as % populations. You’ve “used up” values from many cells to generate one % value and lost degrees of freedom. Percentages of populations from independent repeats should be generated for comparisons (i.e., from experiments on different days, different mice, different batch of cells etc.).
subcellular organelles/bacteria may or may not be independent, too. Is each bacterium within a host cell an experimental unit? It could be depending on what quantity you are measuring and how you express it. Heterogeneity in gene expression could result in different bacterial cells interacting differently within a single host cell – with the right experimental design you can measure cell-to-cell differences within a population of bacteria or subcellular organelles within a cell.
single-molecule experiments: are observations on individual molecules in the same cell independent? How many should you measure? How many biological/biochemical independent experiments should you perform?
tissues from mice: are multiple measurements on the same tissue within a mouse independent? If you make a measurement on each lobe of the lung or liver, are they independent? How? Why? If you took 3 measurements on the liver of 3 mice, how many independent observations do you have? If you repeated this on another day with another 3 mice, how many independent observations do you now have?
Hierarchical experimental designs, for example where we have 50 single-cell measurements from in five independent experiments, can be analysed using mixed effects models which we discussed above.
3.14 Data analysis pointers
If you have collected data, just go ahead and plot the data! Plotting raw data, eye-balling it, exploratory data-fitting and analyses are highly recommended regardless of what you think has happened while collecting data in your laboratory experiments.
3.14.1 Normalisation
Normalisation within an experiment is usually done to calculate fold-change over a “control”. This may involve dividing all groups with the “control” or “untreated” group and obtaining ratios as fold-change or percentages. Once this is done, all control or untreated values will be the same (i.e. 1 or 100%) and have an SD of 0! No other group should be compared to this group as it has no SD and violates the assumptions of the normal probability distribution. Instead, normalised data should be analysed with the one-sample t test. This test asks whether the observed value (e.g. your ratio
Alternatively, you could use linear mixed models, which allow different baselines, and avoid normalisation.
3.14.2 Data transformation
After performing a Student’s t tests or ANOVA, if the residuals are very skewed and not even approximately normally distributed, simple transformations of the raw data may be useful. Data transformations may make residuals to be normally or nearly normally distributed. Commonly used translations are: log (to any base),
Residual distribution can be checked with the Shapiro-Wilk test, however, this can be unreliable when the number of samples
3.14.2.1 Percentages and probabilities
Percentages and probabilities have hard lower and upper limits: 0 and 100 for percentages and 0 and 1 for probabilities. Observed values close to the lower (0) or upper (1 or 100) limits will not be symmetrically distributed about the mean because at these “walls” data are ‘clipped’ (i.e., only one side of the normal distribution). For example, values of mean 95% with an SD of 5% are unlikely to be normally distributed because >100% is not meaningful. The same is true near 0. This can be remedied with
3.14.2.2 Log-normal distribution
Values in microbiology and immunology, and biology in general, can often vary as fold-changes, proportions or log-orders of magnitude. For example when measuring the kill-curve of an antibiotic over time, we may observe 10-, 100-, 1000- fold killing over time. These changes in colony forming units (CFU) are changing exponentially to the power of 10. This just means that the log of these numbers are normally distributed and we should perform statistical comparisons on log-transformed data.
Exponential changes in values are also common in gene expression data such as RT-qPCR, which is essentially doubling of DNA per cycle i.e. increasing by power of 2. Log to the base 2 is often calculated for differential gene expression to bring their distribution closer to the normal distribution. Statistical comparisons are then performed on log-transformed data. Obviously, if you need to report mean
Paired-ratio tests or
Results will be on log10-scale and need to be back-transformed (
3.14.2.3 Ratios versus Differences
Exploratory plots of data such as by subtractions of means to find difference between groups or plotting their ratios can tell us how groups differ from each other. This is why exploring and eye-balling data is very important. Sometimes we may know this from previous experiments.
It is common in biology for parameters to vary as fold-changes, i.e., as multiples of various factors (some known and some unknown). For example, the level of cytokines in serum upon infection of a mouse may depend on dose of pathogen times the virulence gene expression times the influx of myeloid cells times microbiome times unknown factors, and so on. Similar effects operate in vitro, where the confluency of a cell line, multiplicity of infection, cell culture medium and various other parameters may influence cytokine production. Such values will be log-normally distributed.
Let’s consider the following two scenarios of a new antimicrobial drug. Consider the first case where we measured doubling time of E. coli in the absence or presence of the drug on 4 different occasions. Our doubling times varied markedly between experiments – 20, 40, 50 and 80 min – but in all experiments the presence of the drug led to slower growth and increase in doubling time by approximately by 20 min. Despite the variability, the difference between control and antibiotic treated samples is consistently ~ 20 min. Recall that a Student’s t test calculates differences by subtracting group means (and an F test calculates the differences by subtracting values from overall means to calculate variances between groups). So these are only appropriate when the difference between is consistent (e.g., subtraction of two means is consistent).
Now consider a second antibiotic which we find increases the doubling time in all four experiments by 5-fold. Meaning, that the doubling times in the presence of the drug are approximately 100, 200, 250 and 400 min, respectively (again, we would suspect this by plotting graphs of differences and ratios of the means in the presence and absence of the antibiotic). You’ll note that now the proportion is consistent (and not the difference) and the t or F statistic will have a large
In such cases, where fold-changes are involved, we can use the properties of logarithms to perform statistical tests. Mathematically, the log of a ratio of two numbers is same as the difference between the log of the two numbers.
Whether the ratio 100 min/ 20 min = 5 is consistent can be tested by performing the statistical tests on log-transformed values!
So when data vary as fold-changes or ratios or exponents, consider statistical comparisons of log-transformed data. Remember to back-transform log values to the original scale when reporting results. However, also keep in mind that log(mean of data values)
3.15 Power analysis
Power analysis is used to determine the minimum sample size for a study so that we can detect a difference in a variable of interest of a magnitude that is biologically meaningful at a defined
3.15.1 Sample sizes and distributions
We previously considered the theoretical population of all E. coli doubling times with a mean of 20 min with an SD of 2 min for
To design the experiment to compare the doubling times of these two strains with sufficient power – that is to ensure it is reproducible, let’s first us see how sample size rnorm
in R which just needs the mean and SD to provide random samples from the normal distribution). The means and standard deviations of 25 such simulations, each with a sample size
The finding depicted in Figure 3.18 is somewhat intuitive: if we draw a larger number of independent, accurate samples we are more likely to find the true population parameters. Power analyses allows us calculate the minimum sample size so that 80 % (typical level of power) of repeated studies will find the same result at the defined significance threshold (typically
3.15.2 Power & error rates
Power of a study is linked to the significance threshold
A false-positive or Type I error occurs when we assume something is true when in reality it is false (e.g., when we reject the null hypothesis when in reality it is true; we think the two group means are different when in reality they are not; we tell a patient something is wrong with them when everything is OK). Typically,
A false-negative or Type II error (
The connection between power,
The graphs in Figure 3.19 show the the sampling distributions of the doubling times of
3.15.3 Sample size & confidence intervals
The connection between power and sample size can also be seen from confidence intervals. We saw above that
The error margin can be defined as:
E =
Rearranging this we can get sample size
The following page from Lisa Sullivan at Boston University School of Public Health is easy to follow and with simple worked out examples.
Like in all power calculation, what the value of E to use is up to the scientist – this is not a statistical question. The difference we want to be able to detect between groups is based on what is a clinical or biologically meaningful or feasibile (e.g., time, cost, resources etc.).
3.15.4 Calculating power
For simple cases such as two groups (various types of t tests) the sample size can be calculated with the R package pwr
[2] or software Gpower. Examples are shown in Chapter 8.
Power analysis should be performed before starting large experiments. Post-hoc power analysis should not be done to find the power of a study that’s been completed.
Gpower is a free programme that will carry out power analysis for most simple designs.
When analysing more than two groups, especially mixed models with random factors, power is more accurately calculated by simulation than from effect sizes. What this means is the following: once we have an estimated simr
package [3].
Work with live animals requires power analysis for planning reproducible experiments. The idea is that findings form work with animals should be statistically reproducible at least 80 % of the time to avoid wastage of animals, resources, time etc. Designs should use the fewest but necessary number of animals that achieve statistically significant (typically P
There’s another requirement in frequentist analysis: once the sample size is decided, P values should not be calculated until the whole study is finished (i.e., all participants are recruited and analysed, or all experiments are completed