# Two-Proportions Z-Test in R

# What is two-proportions z-test?

**two-proportions**

**z-test**is used to compare two observed proportions. This article describes the basics of

**two-proportions***z-test

**and provides pratical examples using**R sfoftware**.

For example, we have two groups of individuals:

- Group A with lung cancer: n = 500
- Group B, healthy individuals: n = 500

The number of smokers in each group is as follow:

- Group A with lung cancer: n = 500, 490 smokers, \(p_A = 490/500 = 98%\)
- Group B, healthy individuals: n = 500, 400 smokers, \(p_B = 400/500 = 80%\)

In this setting:

- The overall proportion of smokers is \(p = frac{(490 + 400)}{500 + 500} = 89%\)
- The overall proportion of non-smokers is \(q = 1-p = 11%\)

We want to know, whether the proportions of smokers are the same in the two groups of individuals?

# Research questions and statistical hypotheses

Typical research questions are:

- whether the observed proportion of smokers in group A (\(p_A\))
*is equal*to the observed proportion of smokers in group (\(p_B\))? - whether the observed proportion of smokers in group A (\(p_A\))
*is less than*the observed proportion of smokers in group (\(p_B\))? - whether the observed proportion of smokers in group A (\(p_A\))
*is greater than*the observed proportion of smokers in group (\(p_B\))?

In statistics, we can define the corresponding *null hypothesis* (\(H_0\)) as follow:

- \(H_0: p_A = p_B\)
- \(H_0: p_A \leq p_B\)
- \(H_0: p_A \geq p_B\)

The corresponding *alternative hypotheses* (\(H_a\)) are as follow:

- \(H_a: p_A \ne p_B\) (different)
- \(H_a: p_A > p_B\) (greater)
- \(H_a: p_A < p_B\) (less)

Note that:

- Hypotheses 1) are called
**two-tailed tests** - Hypotheses 2) and 3) are called
**one-tailed tests**

# Formula of the test statistic

## Case of large sample sizes

The test statistic (also known as **z-test**) can be calculated as follow:

\[ z = \frac{p_A-p_B}{\sqrt{pq/n_A+pq/n_B}} \]

where,

- \(p_A\) is the proportion observed in group A with size \(n_A\)
- \(p_B\) is the proportion observed in group B with size \(n_B\)
- \(p\) and \(q\) are the overall proportions

- if \(|z| < 1.96\), then the difference
**is not significant**at 5% - if \(|z| \geq 1.96\), then the difference
**is significant**at 5% - The significance level (p-value) corresponding to the z-statistic can be read in the z-table. We’ll see how to compute it in R.

Note that, the formula of z-statistic is valid only when sample size (\(n\)) is large enough. \(n_Ap\), \(n_Aq\), \(n_Bp\) and \(n_Bq\) should be \(\geq\) 5.

## Case of small sample sizes

The **Fisher Exact probability test** is an excellent non-parametric technique for comparing proportions, when the two independent samples are small in size.

# Compute two-proportions z-test in R

## R functions: prop.test()

The R functions **prop.test**() can be used as follow:

```
prop.test(x, n, p = NULL, alternative = "two.sided",
correct = TRUE)
```

**x**: a vector of counts of successes**n**: a vector of count trials**alternative**: a character string specifying the alternative hypothesis**correct**: a logical indicating whether Yates’ continuity correction should be applied where possible

Note that, by default, the function **prop.test()** used the Yates continuity correction, which is really important if either the expected successes or failures is < 5. If you don’t want the correction, use the additional argument *correct = FALSE* in prop.test() function. The default value is TRUE. (This option must be set to FALSE to make the test mathematically equivalent to the uncorrected z-test of a proportion.)

## Compute two-proportions z-test

We want to know, whether the proportions of smokers are the same in the two groups of individuals?

```
res <- prop.test(x = c(490, 400), n = c(500, 500))
# Printing the results
res
```

```
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
```

The function returns:

- the value of Pearson’s chi-squared test statistic.
- a p-value
- a 95% confidence intervals
- an estimated probability of success (the proportion of smokers in the two groups)

Note that:

- if you want to test whether the observed proportion of smokers in group A (\(p_A\))
*is less than*the observed proportion of smokers in group (\(p_B\)), type this:

```
prop.test(x = c(490, 400), n = c(500, 500),
alternative = "less")
```

- Or, if you want to test whether the observed proportion of smokers in group A (\(p_A\))
*is greater than*the observed proportion of smokers in group (\(p_B\)), type this:

```
prop.test(x = c(490, 400), n = c(500, 500),
alternative = "greater")
```

## Interpretation of the result

The **p-value** of the test is 2.36310^{-19}, which is less than the significance level alpha = 0.05. We can conclude that the proportion of smokers is significantly different in the two groups with a **p-value** = 2.36310^{-19}.

Note that, for 2 x 2 table, the standard chi-square test in **chisq.test**() is exactly equivalent to **prop.test**() but it works with data in matrix form.

## Access to the values returned by prop.test() function

The result of **prop.test()** function is a list containing the following components:

**statistic**: the number of successes**parameter**: the number of trials**p.value**: the**p-value**of the test**conf.int**: a confidence interval for the probability of success.**estimate**: the estimated probability of success.

The format of the **R** code to use for getting these values is as follow:

```
# printing the p-value
res$p.value
```

`[1] 2.363439e-19`

```
# printing the mean
res$estimate
```

```
prop 1 prop 2
0.98 0.80
```

```
# printing the confidence interval
res$conf.int
```

```
[1] 0.1408536 0.2191464
attr(,"conf.level")
[1] 0.95
```

# See also

# Infos

This analysis has been performed using **R software** (ver. 3.2.4).

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

## Recommended for You!

## Recommended for you

This section contains best data science and self-development resources to help you on your path.

### Coursera - Online Courses and Specialization

#### Data science

- Course: Machine Learning: Master the Fundamentals by Standford
- Specialization: Data Science by Johns Hopkins University
- Specialization: Python for Everybody by University of Michigan
- Courses: Build Skills for a Top Job in any Industry by Coursera
- Specialization: Master Machine Learning Fundamentals by University of Washington
- Specialization: Statistics with R by Duke University
- Specialization: Software Development in R by Johns Hopkins University
- Specialization: Genomic Data Science by Johns Hopkins University

#### Popular Courses Launched in 2020

- Google IT Automation with Python by Google
- AI for Medicine by deeplearning.ai
- Epidemiology in Public Health Practice by Johns Hopkins University
- AWS Fundamentals by Amazon Web Services

#### Trending Courses

- The Science of Well-Being by Yale University
- Google IT Support Professional by Google
- Python for Everybody by University of Michigan
- IBM Data Science Professional Certificate by IBM
- Business Foundations by University of Pennsylvania
- Introduction to Psychology by Yale University
- Excel Skills for Business by Macquarie University
- Psychological First Aid by Johns Hopkins University
- Graphic Design by Cal Arts

### Books - Data Science

#### Our Books

- Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
- Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
- Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
- R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
- GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
- Network Analysis and Visualization in R by A. Kassambara (Datanovia)
- Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
- Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

#### Others

- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
- Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
- Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
- An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
- Deep Learning with R by François Chollet & J.J. Allaire
- Deep Learning with Python by François Chollet