Hypothesis Testing
Board Coverage
| Board | Paper | Notes |
|---|---|---|
| AQA | Paper 1, 2 | Binomial tests in P1; normal tests in P2 |
| Edexcel | P1, P2 | Similar |
| OCR (A) | Paper 1, 2 | Includes critical regions |
| CIE (9709) | P1, P6 | Basic hypothesis testing in P6 |
:::info Hypothesis testing requires clear, structured answers. Always state your hypotheses, test Statistic, critical value/region, comparison, and conclusion in context. :::
1. Hypotheses
1.1 Null and alternative hypotheses
Definition.
- The null hypothesis is the default assumption ( “no effect” or “no change”).
- The alternative hypothesis is what we are trying to find evidence for.
1.2 One-tailed and two-tailed tests
- One-tailed: (right-tailed) or (left-tailed).
- Two-tailed: .
The choice depends on the research question. Use a one-tailed test only when you have a specific Directional prediction before seeing the data.
:::caution Choosing a one-tailed test after seeing the data (because the results happen to go in one Direction) is a form of -hacking and is statistically invalid. The tail direction must be decided Before the experiment. :::
2. Critical Values and Significance Levels
2.1 Significance level
Definition. The significance level is the maximum probability of incorrectly Rejecting when it is true. Common values: 1%, 5%, 10%.
2.2 Critical value
The critical value is the boundary between the acceptance and rejection regions.
2.3 Critical region
The critical region (rejection region) is the set of values of the test statistic that lead to Rejection of .
2.4 Actual significance level
For discrete distributions, the actual significance level may differ from the nominal level Because we cannot achieve exactly .
Example. For A right-tailed test at :
. We find the smallest such that .
, .
Critical region: . Actual significance level: 1.76%.
3. Type I and Type II Errors
3.1 Definitions
Definition.
-
Type I error: Rejecting when is true (false positive).
-
Type II error: Failing to reject when is false (false negative).
-
The power of a test is .
3.2 Relationship
Decreasing (making the test stricter) generally increases (more false Negatives). There is always a trade-off between Type I and Type II errors.
Intuition. Think of a courtroom: a Type I error is convicting an innocent person; a Type II Error is acquitting a guilty person. Making the standard of proof higher (beyond reasonable doubt) Reduces Type I errors but increases Type II errors. You cannot eliminate both simultaneously.
4. Hypothesis Testing Procedure
4.1 Standard method
- Define the random variable and its distribution under .
- State and .
- State the significance level .
- Calculate the critical region (or critical value).
- Determine the test statistic from the data.
- Compare the test statistic to the critical value.
- Conclude in context.
4.2 Using -values
Alternatively: 1–3. Same as above. 4. Calculate the -value: the probability of obtaining a Result at least as extreme as the observed value, assuming is true. 5. If -value Reject . Otherwise, do not reject . 6. Conclude in context.
5. Binomial Hypothesis Tests
5.1 Single proportion test
Example. A coin is tossed 20 times and lands heads 15 times. Test at the 5% significance level Whether the coin is biased towards heads.
.
, . One-tailed, .
Under : .
Find such that .
. .
Critical region: . Since is in the critical region, we reject .
There is sufficient evidence at the 5% level that the coin is biased towards heads.
6. Normal Hypothesis Tests
6.1 Test for a mean (known variance)
Example. A machine fills bags with mean weight 500g. A sample of 30 bags gives G. Test at the 5% level whether the mean weight has decreased, given G.
, . .
Under : .
.
Critical value: .
Since We reject .
There is sufficient evidence that the mean weight has decreased.
6.2 Large sample test for a proportion
For large : approximately.
Test statistic: .
7. Interpreting Results
:::caution “Failing to reject ” is not the same as “proving is true.” It means the Data does not provide sufficient evidence against . The test may lack power (sample too small, Effect too weak). :::
8. One-Tailed vs Two-Tailed Tests in Depth
8.1 Choosing between one-tailed and two-tailed
Use a one-tailed test when:
- The research question has a specific directional prediction established before data collection.
- Only one direction of deviation is practically meaningful.
- The consequence of missing an effect in the unexpected direction is negligible.
Use a two-tailed test when:
- You are interested in any difference from Regardless of direction.
- You want a more conservative test that is harder to reach significance with.
- There is no strong prior reason to expect the effect in one specific direction.
Example. Testing whether a new teaching method changes exam scores:
- One-tailed (): justified only if prior research strongly suggests the method improves scores, and you would not act on a decrease.
- Two-tailed (): appropriate if the method is new and could either help or harm, and either outcome matters.
8.2 Critical region comparison
For a test at significance level The allocation of the significance level differs:
- One-tailed: The entire goes into one tail. The critical value is at the quantile (right-tailed) or quantile (left-tailed).
- Two-tailed: goes into each tail. The critical values are at the and quantiles.
This means the two-tailed test has a higher bar for each individual tail.
Example. Standard normal test at :
- One-tailed (): reject if .
- Two-tailed (): reject if or .
An observed is significant for the one-tailed test () but not for the Two-tailed test ().
:::info A two-tailed test at level requires a more extreme test statistic than a one-tailed Test at the same Because the significance “budget” is split between two tails. A Two-tailed test at corresponds roughly to two one-tailed tests each at . :::
8.3 Effect on power
For the same A one-tailed test has greater power than a two-tailed test against an Alternative in the predicted direction, because the critical value is closer to the null value. However, a one-tailed test has zero power to detect an effect in the opposite direction.
9. Binomial Tests with Normal Approximation
9.1 When to use the normal approximation
When is sufficiently large, the binomial distribution can be approximated by a normal Distribution. The standard conditions are:
Under these conditions:
Equivalently, for the sample proportion :
:::caution Warning ), not the observed sample proportion . :::
9.2 Continuity correction
Since the binomial distribution is discrete and the normal distribution is continuous, a continuity correction improves the accuracy of the approximation:
- For Use .
- For Use .
- For Use in the normal.
9.3 Worked example
Example. Historically, 40% of students at a school take the bus. In a survey of 120 students, 58 Take the bus. Test at the 5% level whether the proportion has changed.
. , . Two-tailed, .
Check conditions using : and . Conditions satisfied.
Under : So .
Using continuity correction:
Two-tailed critical values: . Since do not reject .
There is insufficient evidence at the 5% level that the proportion of bus users has changed.
10. Confidence Intervals
10.1 Definition
A confidence interval gives a range of plausible values for a population parameter, together With a specified level of confidence.
Definition. A confidence interval for a parameter is an interval constructed from sample data such that, in repeated sampling, of such Intervals would contain the true value of .
:::caution A 95% confidence interval does not mean there is a 95% probability that lies In the interval. The parameter is fixed; it either is or is not in the interval. The 95% Refers to the long-run proportion of intervals (across many repeated samples) that capture . :::
10.2 95% confidence interval for a population proportion
For large where and The sample proportion Is approximately normal. The confidence interval for is:
For a 95% confidence interval, :
The margin of error is Which decreases as increases.
10.3 Connection to hypothesis testing
There is a direct and important link between confidence intervals and two-tailed hypothesis tests:
- A confidence interval contains exactly those values of that would not be rejected by a two-tailed test of at level .
- If falls outside the confidence interval, then is rejected at level .
- If falls inside the confidence interval, then is not rejected at level .
Example. Using the bus survey data: , .
Since lies inside We do not reject at the 5% level. This is consistent with the hypothesis test result in Section 9.3.
11. Interpreting p-Values
11.1 Formal definition
Definition. The -value is the probability of obtaining a test statistic at least as Extreme as the observed value, assuming is true.
For a two-tailed test, “at least as extreme” means at least as far from the null value in either Direction, so the -value is doubled.
11.2 Decision rule
- If : reject . The result is statistically significant.
- If : do not reject . The result is not statistically significant.
11.3 Strength of evidence
The smaller the -value, the stronger the evidence against :
| -value range | Strength of evidence against |
|---|---|
| Little to no evidence | |
| Weak evidence | |
| Moderate evidence | |
| Strong evidence | |
| Very strong evidence |
11.4 Common misinterpretations
11.5 Worked example
Example. A factory produces components with mean length 50 mm. A sample of 40 components gives mm. Given mm, find the -value for testing vs .
Under : .
Since We do not reject at the 5% level.
Interpretation: If the true mean were 50 mm, there would be approximately a 9.2% chance of Observing a sample mean at least as far from 50 mm as 50.8 mm. This is not unusual enough to provide Convincing evidence against .
Problem Set
Problem 1
A die is rolled 60 times and a 6 appears 16 times. Test at the 5% level whether the die is biased.Solution 1
$X \sim B(60, p)$. $H_0: p = 1/6$, $H_1: p \neq 1/6$. Two-tailed, $\alpha = 0.05$.Under : . .
Using normal approximation: .
Two-tailed: critical values . So reject .
There is evidence at the 5% level that the die is biased.
If you get this wrong, revise: Binomial Hypothesis Tests — Section 5.
Problem 2
A manufacturer claims that 90% of their products pass quality control. In a sample of 200, 170 pass. Test the claim at the 5% significance level.Solution 2
$X \sim B(200, p)$. $H_0: p = 0.9$, $H_1: p \lt 0.9$. Left-tailed, $\alpha = 0.05$..
Under : .
.
Critical value: . Since reject .
There is evidence that the proportion passing quality control is less than 90%.
If you get this wrong, revise: Normal Hypothesis Tests — Section 6.
Problem 3
Explain the difference between a Type I error and a Type II error in the context of medical testing.Solution 3
**Type I error:** The test says a healthy person is sick (false positive). This leads to unnecessary treatment and anxiety.Type II error: The test says a sick person is healthy (false negative). This means the person Goes untreated, potentially with serious consequences.
If you get this wrong, revise: Type I and Type II Errors — Section 3.
Problem 4
Find the critical region for a test of $H_0: p = 0.3$ vs $H_1: p \gt 0.3$ using $X \sim B(10, p)$ at the 5% level.Solution 4
Under $H_0$: $X \sim B(10, 0.3)$.. .
Critical region: . Actual significance level: 4.73%.
If you get this wrong, revise: Critical Region — Section 2.3.
Problem 5
The mean lifetime of a bulb is claimed to be 1000 hours. A sample of 50 bulbs gives $\bar{x} = 985$ hours with $s = 40$ hours. Test at the 1% level whether the mean lifetime is less than 1000 hours.Solution 5
$H_0: \mu = 1000$, $H_1: \mu \lt 1000$. $\alpha = 0.01$.approximately.
.
Critical value at 1%: . Since reject .
There is evidence at the 1% level that the mean lifetime is less than 1000 hours.
If you get this wrong, revise: Normal Hypothesis Tests — Section 6.
Problem 6
For $X \sim B(20, 0.5)$Find the critical region for a two-tailed test at the 10% significance level.Solution 6
Under $H_0$: $X \sim B(20, 0.5)$.For each tail, we need and .
Lower: , . So . Upper: , . So .
Critical region: or . Actual significance level: .
If you get this wrong, revise: Critical Values and Significance Levels — Section 2.
Problem 7
A teacher claims that the average score on a test is 70%. In a class of 25, the mean score is 66% with standard deviation 12%. Test at the 5% level.Solution 7
$H_0: \mu = 70$, $H_1: \mu \neq 70$. Two-tailed, $\alpha = 0.05$.approximately.
.
Two-tailed critical values: . So do not reject .
There is insufficient evidence at the 5% level that the mean score differs from 70%.
If you get this wrong, revise: Normal Hypothesis Tests — Section 6.
Problem 8
A drug is effective for 60% of patients. After a new treatment, 18 out of 25 patients are cured. Test whether the new treatment is more effective at the 5% level.Solution 8
$X \sim B(25, p)$. $H_0: p = 0.6$, $H_1: p \gt 0.6$. Right-tailed, $\alpha = 0.05$.Under : .
. .
Critical region: . Since do not reject .
Insufficient evidence that the new treatment is more effective.
If you get this wrong, revise: Binomial Hypothesis Tests — Section 5.
Problem 9
Explain why failing to reject $H_0$ does not mean $H_0$ is true.Solution 9
Failing to reject $H_0$ means the data is consistent with $H_0$ but does not prove it. The test may lack sufficient power to detect a real effect. For example, if a drug has a small but real benefit, a small sample may not detect it, leading us to fail to reject $H_0$ even though the drug is effective. The absence of evidence is not evidence of absence.If you get this wrong, revise: Interpreting Results — Section 7.
Problem 10
For a test of $H_0: \mu = 50$ vs $H_1: \mu \gt 50$ at the 5% level with $\sigma = 4$ and $n = 16$Find the probability of a Type II error if the true mean is $\mu = 52$.Solution 10
Under $H_0$: $\bar{X} \sim N(50, 16/16) = N(50, 1)$.Critical value: .
Type II error = failing to reject when .
under the true distribution.
.
So and the power is .
If you get this wrong, revise: Type I and Type II Errors — Section 3.
Problem 11
A researcher tests whether a new drug changes recovery time. She uses a two-tailed test of $H_0: \mu = \mu_0$ vs $H_1: \mu \neq \mu_0$ at $\alpha = 0.05$ and obtains $z = 1.85$. (a) What is her conclusion? (b) If she had instead used a right-tailed test $H_1: \mu \gt \mu_0$ at the same level, would her conclusion change? Explain.Solution 11
(a) Two-tailed test: critical values $\pm 1.96$. $|1.85| = 1.85 \lt 1.96$So **do not reject** $H_0$. There is insufficient evidence that recovery time has changed.(b) One-tailed test: critical value . Since We reject . There is Sufficient evidence that recovery time has increased.
The conclusion changes because a one-tailed test allocates the entire 5% significance level to one Tail, making the critical value less extreme. This illustrates why the choice between one-tailed and Two-tailed must be made before seeing the data.
If you get this wrong, revise: One-Tailed vs Two-Tailed Tests in Depth — Section 8.
Problem 12
A survey of 200 households in a town finds that 45 regularly recycle. The national recycling rate is 20%. Test at the 5% level whether the recycling rate in this town differs from the national rate, using a normal approximation with continuity correction.Solution 12
$X \sim B(200, p)$. $H_0: p = 0.20$, $H_1: p \neq 0.20$. Two-tailed, $\alpha = 0.05$.Check conditions using : and . Conditions satisfied.
Under : .
Using continuity correction:
Two-tailed critical values: . So do not reject .
There is insufficient evidence at the 5% level that the recycling rate differs from 20%.
If you get this wrong, revise: Binomial Tests with Normal Approximation — Section 9.
Problem 13
In a random sample of 150 voters, 87 support a new policy. (a) Construct a 95% confidence interval for the true proportion of support. (b) Since the interval does not contain 0.5, a politician claims "a majority of voters support the policy." Is this claim justified?Solution 13
(a) $\hat{p} = 87/150 = 0.58$.Check: and .
(b) The 95% CI is . Since the entire interval lies above 0.5, we can reject at the 5% level. However, the lower bound is only 0.501, so the evidence for a Majority is borderline. The claim is technically supported by the test, but the narrow margin should Be communicated carefully.
If you get this wrong, revise: Confidence Intervals — Section 10.
Problem 14
A 95% confidence interval for a population mean is $(48.2, 53.8)$. State whether $H_0$ would be rejected or not rejected at the 5% level for each of the following null values: (a) $\mu_0 = 50$(b) $\mu_0 = 47$(c) $\mu_0 = 54$. Justify using the connection between confidence intervals and hypothesis tests.Solution 14
A 95% confidence interval contains exactly those values of $\mu_0$ that would **not** be rejected By a two-tailed test at the 5% level.(a) : So do not reject . (b) : So reject . (c) : So reject .
If you get this wrong, revise: Confidence Intervals — Section 10.
Problem 15
A sample of 35 students has mean score 62.4 with known population standard deviation $\sigma = 8$. (a) Find the $p$-value for testing $H_0: \mu = 60$ vs $H_1: \mu \gt 60$. (b) State your conclusion at the 5% significance level and interpret the $p$-value.Solution 15
(a) Under $H_0$: $\bar{X} \sim N(60, 8^2/35) = N(60, 1.829)$.
(b) Since reject at the 5% level. There is sufficient evidence that The true mean score exceeds 60. The -value of 0.038 means that if the true mean were 60, there Would be a 3.8% chance of observing a sample mean of 62.4 or higher. This provides moderate evidence Against .
If you get this wrong, revise: Interpreting p-Values — Section 11.
Problem 16
For a test of $H_0: \mu = 100$ vs $H_1: \mu \gt 100$ with $\sigma = 15$, $n = 25$And $\alpha = 0.05$: (a) Find the critical value in terms of $\bar{x}$. (b) Find the probability of a Type II error and the power of the test if the true mean is $\mu = 108$. (c) How would the power change if $\alpha$ were increased to 0.10?Solution 16
(a) Under $H_0$: $\bar{X} \sim N(100, 15^2/25) = N(100, 9)$So $\sigma_{\bar{X}} = 3$.Critical value: . Reject if .
(b) Type II error when : .
under the true distribution.
So and power .
(c) If The critical value becomes .
Power . Increasing from 0.05 to 0.10 increases the power (from 0.847 to 0.917) but also increases the probability of a Type I error. This illustrates the trade-off Between Type I and Type II errors.
If you get this wrong, revise: Type I and Type II Errors — Section 3.
:::tip Diagnostic Test Ready to test your understanding of Hypothesis Testing? The contains the hardest questions within the A-Level specification for this topic, each with a full worked solution.
Unit tests probe edge cases and common misconceptions. Integration tests combine Hypothesis Testing with other topics to test synthesis under exam conditions.
See for instructions on self-marking and building a personal test matrix.
Common Pitfalls
-
Misreading the question, particularly with ‘hence’ vs ‘hence or otherwise’ — the former requires using previous work.
-
Forgetting to check that solutions satisfy the original equation (especially with squaring both sides or dividing by variables).
-
Dropping negative signs during algebraic manipulation — substitute back to verify your answer.
-
Losing marks by not showing sufficient working — always write out each step, especially in proof questions.
Summary
The key principles covered in this topic are linked in the sub-pages above. Focus on understanding the definitions, applying the formulas or frameworks, and evaluating strengths and limitations of each approach.
Worked Examples
Worked examples demonstrating the application of key concepts are covered in the detailed sub-pages linked above.
:::