This is an Observable notebook I created to explain statistical t-tests.
In statistics, t-tests are a type of hypothesis test that allows you to compare sample means. They are called t-tests because they boil your sample data down to one number, the t-value. In this explorable, we will develop an intuitive understanding of the t-test. I do assume that you're familiar with the normal distribution, standard deviation and the standard error of the mean. If not, I recommend checking out my article on the standard error of the mean first.
There are three types of t-tests, but we'll focus on the one-sample t-test, used to compare the mean of a population with a theoretical value. Once we understand it, I promise that the other types will be a breeze. Shall we get started then?
It's recommended that you view this page on a wide screen. The plots don't look very good on smaller screens.
- A t-test is a statistical test used to determine if there is a significant difference between the means of two groups.
- The one-sample t-test is used to compare the mean of a population with a theoretical value.
- The t-value can be interpreted as the difference between the means expressed as units of standard error.
- The shape of the t-distribution, and thus the probability of getting a particular t-value, depends on the number of degrees of freedom, ν.
- A hypothesis test uses sample data to determine whether to reject the null hypothesis.
- Critical t-value approach: Compare the calculated t-value to the critical t-value for the hypothesis for a given degrees of freedom and significance level.
- P-value approach: Compare the P-value corresponding to the calculated t-value to the significance level.
- T-test demo
- A t-test on a larger sample is more likely to achieve statistical significance.
- T-tests can result in Type I and Type II errors.
- There are three main types of t-tests.
- Conclusion
- References
- Appendix
A t-test is a statistical test used to determine if there is a significant difference between the means of two groups.

xkcd.com
Say we want to compare the mileage between two different types of cars: A and B, we could measure it on a sample of cars from both populations. We can then calculate a sample mean and standard deviation for both samples. Wouldn't it be just fantastic if we could then 'infer' how the mean of the entire population of A compares to B, based on just the two small samples of data we have?
A t-test is a statistical test used to determine if there is a significant difference between the means of two groups. It was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Ireland. He devised the t-test as a cheap way to monitor the quality of stout (an awful type of beer :P). He published the test in a journal (as scientists do), but was forced to use a pen name by his employer, who wanted to keep their use of statistics a secret from their competitors. The t-test he devised is known as the Student's t-test since he published it under the pseudonym Student.

William Sealy Gosset
The one-sample t-test is used to compare the mean of a population with a theoretical value.
The plot below shows two samples of data and their means. We want to compare the sample means with a hypothesised mean. For the sample on the left, we can see that there is a difference between its mean and the hypothesised mean. However, the sample data is so spread out; we can't easily tell if the difference is real. For the sample on the right, the data is much more tightly clustered, and the gap between the sample and the hypothesised mean is visually quite obvious. A t-test allows us to quantify the statistical significance of a difference between the means.
The t-value can be interpreted as the difference between the means expressed as units of standard error.
The first step in a t-test is to calculate the t-value, a type of inferential statistic.
where:
- is the sample mean,
- is the hypothesised mean,
- is the standard deviation,
- is the sample size.
Feeling a little overwhelmed by the above formula? Don't worry! It was the same for me. It might help if you interpret the t-value like this:
The numerator, the signal, is the difference between the sample mean and the hypothesised mean. The denominator, the noise, is the standard error of the mean. Therefore, the t-value is simply the difference between the means expressed as units of standard error. We can infer that:
- A large t-value indicates that the difference between the means is significantly different.
- A small t-value suggests that the means are similar.
The shape of the t-distribution, and thus the probability of getting a particular t-value, depends on the number of degrees of freedom, ν.
The t-distribution is a probability distribution that is used to estimate population parameters when the sample size is small or when the population variance is unknown. If we took multiple random samples of data from the same populations and compared the sample means to the population mean, we'd get slightly different t-values each time due to random sampling error. Most of the time, we expect to get a t-value of zero.
The shape of the t-distribution, and thus the probability of getting a particular t-value, depends on the number of degrees of freedom, ν. For a one-sample t-test:
The t-distribution is plotted below, alongside the normal z-distribution for comparison.
There are a few subtleties I'd like you to appreciate here.
t-values follow a distribution with μ = 0, and σ = 1 because the formula for the t-value 'standardises' the sample means. If the formula for the t-value becomes:
A random variable is standardised by subtracting the mean of the distribution from the value being standardised, and then dividing by the standard deviation of the distribution.
t-values follow a t-distribution rather than a normal distribution (also called z-distribution) because for small samples observations are more likely to fall beyond two standard deviations from the mean than under the normal distribution. In other words, the standard error is less accurate for a small sample. The thicker tails of the t-distribution are a correction to account for the poorly estimated standard error. As the sample size is inreased, the t-distribution becomes similar to a normal distribution.
Why is the degrees of freedom 'n-1' and not just 'n'?
Imagine we have n = 3, and so we have three numbers a, b, c. Their mean is 10.
- Let's assume that a = 5 => mean of (5, b, c) = 10.
- Then we assume that b = 15 => mean of (5, 15, c) = 10.
- Can we assume a value for c?
No, we can't. c has to be 10 to satisfy the condition that the mean of (a, b, c) = 10. Therefore, for n = 3, we only have the freedom to choose values for two observations. Thus, n-1 is the number of degrees of freedom for measuring the sample mean.
A hypothesis test uses sample data to determine whether to reject the null hypothesis.
The null, , and alternative hypotheses, , are two mutually exclusive statements. The null hypothesis states that a population parameter, such as the mean, is equal to a hypothesized value. The null hypothesis is often an initial claim that is based on previous analyses or specialized knowledge. The alternative hypothesis states that a population parameter is smaller, greater, or different than the hypothesized value in the null hypothesis. The alternative hypothesis is what you might believe to be true or hope to prove true by rejecting the null hypothesis.
The default null hypothesis is that the means are equal.
For example:
: The average marks for the 8th grade = 75%.
: The average marks for the 8th grade <> 75%.
We never “accept” the null hypothesis. It is safer to say that we “fail to reject” the null hypothesis, i.e., that there is insufficient statistical evidence to reject it. It turns out it's much easier to disprove a hypothesis than to positively prove one. We live in a sad little world, don't we? :P
Critical t-value approach: Compare the calculated t-value to the critical t-value for the hypothesis for a given degrees of freedom and significance level.
If we get a sample t-value close to 0, then we fail to reject the null hypothesis. However, if the t-value is greater than a critical t-value, then the sample likely came from a population with a different mean to the hypothesised mean.
To determine the critical t-value, we need first to determine the significance level we're after. The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference. It is also equivalent to a 95% confidence level.
The next step is to determine whether we are doing a two-tailed test or one-tailed test.
- A two-tailed test is non-directional; it allows us to determine if the sample mean is different from the hypothetical mean, i.e., either greater or less than.
- A one-tailed test is directional; it allows us to determine if the sample mean is greater or less than the hypothetical mean, but not both. One-tailed tests are only used when we are not worried about missing an effect in the untested direction. The advantage is that the one-tailed test achieves a higher level of significance for a smaller t-value.
For example, say we devised a training program to help students better understand maths. We want to understand if there is a statistically significant improvement in their marks after the training program. However, we're not worried about them being worse off after the training program. So, a one-tailed test may be used.
We can now use an online calculator to work out the critical t-value for the given degrees of freedom and significance level.
Finally, we compare the calculated t-value for the sample to the critical t-value.
- If the calculated t-value for the sample is greater than the critical t-value from the table, we can conclude that the difference between the means for the two groups is significantly different. We reject the null hypothesis in favour of the alternative hypothesis.
- On the other hand, if the calculated t-value for the sample is lower than the critical t-value, we can conclude that the difference between the means for the two groups is not significantly different. We fail to reject the null hypothesis.
Let's see this in action. :)
If you don't have access to a computer, it's possible to look up t-values required for a given confidence level in a t-table as shown below. Why don't you try working out the critical t-value in the plot above using the t-table?

P-value approach: Compare the P-value corresponding to the calculated t-value to the significance level.
The P-value approach is sort of the opposite of the critical t-value approach.
We determine the probability, assuming the null hypothesis were true, of observing a more extreme test statistic in the direction of the alternative hypothesis than the one observed. This probability is called the P-value.
We then compare the P-value to the significance level, α:
- If P-value ≤ α, reject the null hypothesis in favour of the alternative hypothesis.
- If P-value > α, do not reject the null hypothesis.
T-test demo
Now that we're t-test experts, let's give it a go :)
Use the inputs below to define a population, a sample size and a hypothesised mean to compare the sample to. Following the sample visualisation, there are further inputs to customise the t-test to be applied. You can then see the familiar t-distribution curve and the outcome of the t-test.
A t-test on a larger sample is more likely to achieve statistical significance.
Take care when interpreting t-tests. If a t-test doesn't achieve statistical significance, it could be because:
- The difference (signal) isn't large enough.
- The variation (noise) within the samples is too great.
- The sample is too small.
It may be possible to reduce the noise or increase the sample size to get better statistical significance. On the other hand, an extremely large sample can produce statistically significant results even when the difference is very small and has no practical consequence. In other words, a t-test does not tell us whether the difference in means is relevant. That's up to us to determine based on our knowledge of the problem domain.
T-tests can result in Type I and Type II errors.
Type I: Rejecting the null hypothesis when it is in fact true. A false positive.
Type II: Not rejecting the null hypothesis when in fact the alternate hypothesis is true. A false negative.

H0: Subject is not pregnant.
H1: Subject is pregnant.
The t-test isn't foolproof. There are lots of ways it could go wrong. Bad luck, for example! You could also come to wrong conclusions if you are applying the t-test to data that violates the underlying assumptions of a t-test:
- Dependant variable follows a continuous or ordinal scale.
- Sample data is collected from a representative, randomly selected portion of the population.
- Sample data follows a bell-shaped distribution curve.
- Reasonably large sample size is used.
- Homogeneity of variance: the standard deviations of samples are approximately equal.
- No significant outliers.
Type II errors, in particular, are easily made if the sample size is too small. To minimise the probability of Type II errors, we can find out the sample size required using power analysis.
There are three main types of t-tests.
Now that we understand how the one-sample t-test works, it'll be easier to appreciate the other types.
- The one-sample t-test is used to compare the mean of a population with a theoretical value.
e.g.compare the average marks of a sample of students with a hypothesised value of 60%. - The unpaired two-sample t-test is used to compare the mean of two independent samples.
e.g. compare the average marks between samples of English and Indian students, to understand if there is a significant difference in academic ability between the two populations. - The paired t-test is used to compare the means between two related groups of samples.
e.g. compare the average marks of a sample of students before and after a training program, to understand its effectiveness.
The approach/formulas may be slightly different, but the principles are the same. You can read more about the three types here - Understanding t-tests: 1-sample, 2-sample, and paired t-tests.
Conclusion
Phew! That took forever, huh? We've covered a lot of content, but hopefully, you can now appreciate that t-tests are pretty simple after all. Most things are simple if we take the time to understand them properly. If you're still struggling, go through it one more time, and maybe check out some of the references I used (linked in the section below).
I'm also more than happy to answer any questions/comments you may have. In particular, if you have any tips for how I could make this content more concise/clear, I'd be very grateful!
Most of all, I hope you enjoyed learning t-tests with me. :)
References
- StatsCast: What is a t-test?. YouTube video.
- Student's t-test. YouTube video.
- t-test. Investopedia.
- What Is a t-test? And Why Is It Like Telling a Kid to Clean Up that Mess in the Kitchen?. The Minitab Blog.
- What Are T Values and P Values in Statistics?. The Minitab Blog.
- Understanding t-Tests: 1-sample, 2-sample, and Paired t-Tests. The Minitab Blog.
- Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics. The Minitab Blog.
- McDonald, J.H. Student's t–test for two samples, 2014. Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland.
- Hypothesis testing. PennState.