<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Tue, 14 Apr 2026 11:08:22 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Easy Guides]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/wiki/19" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles of the category: R Basic Statistics]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[Normality Test in R]]></title>
			<link>https://www.sthda.com/english/wiki/normality-test-in-r</link>
			<guid>https://www.sthda.com/english/wiki/normality-test-in-r</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">
<div id="TOC">
<ul>
<li><a href="#install-required-r-packages">Install required R packages</a></li>
<li><a href="#load-required-r-packages">Load required R packages</a></li>
<li><a href="#import-your-data-into-r">Import your data into R</a></li>
<li><a href="#check-your-data">Check your data</a></li>
<li><a href="#assess-the-normality-of-the-data-in-r">Assess the normality of the data in R</a><ul>
<li><a href="#case-of-large-sample-sizes">Case of large sample sizes</a></li>
<li><a href="#visual-methods">Visual methods</a></li>
<li><a href="#normality-test">Normality test</a></li>
</ul></li>
<li><a href="#infos">Infos</a></li>
</ul>
</div>
<p><br/></p>
<p>Many of statistical tests including correlation, regression, t-test, and analysis of variance (ANOVA) assume some certain characteristics about the data. They require the data to follow a <strong>normal distribution</strong> or <strong>Gaussian distribution</strong>. These tests are called <strong>parametric tests</strong>, because their validity depends on the distribution of the data.</p>
<p><span class="warning">Normality and the other assumptions made by these tests should be taken seriously to draw reliable interpretation and conclusions of the research.</span></p>
<p>Before using a parametric test, we should perform some <strong>preleminary tests</strong> to make sure that the test assumptions are met. In the situations where the assumptions are violated, <strong>non-paramatric</strong> tests are recommended.</p>
<p><span class="success">Here, we’ll describe how to check the normality of the data by visual inspection and by significance tests.</span></p>

<br/>
<div class = "small-block content-privileged-friends navr-book">
  <p>Related Book:</p>
        <a href = "https://www.datanovia.com/en/pqs3" target="_blank">
          <img src = "https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/affiliate-marketing/images/r-statistics-for-comparing-means.png" /><br/>
     Practical Statistics in R for Comparing Groups: Numerical Variables
      </a>
</div>
<div class="spacer"></div>

<div id="install-required-r-packages" class="section level1">
<h1>Install required R packages</h1>
<ol style="list-style-type: decimal">
<li><strong>dplyr</strong> for data manipulation</li>
</ol>
<pre class="r"><code>install.packages("dplyr")</code></pre>
<ol start="2" style="list-style-type: decimal">
<li><a href="https://www.sthda.com/english/english/wiki/ggpubr-r-package-ggplot2-based-publication-ready-plots"><strong>ggpubr</strong></a> for an easy ggplot2-based data visualization</li>
</ol>
<ul>
<li>Install the latest version from GitHub as follow:</li>
</ul>
<pre class="r"><code># Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")</code></pre>
<ul>
<li>Or, install from CRAN as follow:</li>
</ul>
<pre class="r"><code>install.packages("ggpubr")</code></pre>
</div>
<div id="load-required-r-packages" class="section level1">
<h1>Load required R packages</h1>
<pre class="r"><code>library("dplyr")
library("ggpubr")</code></pre>
</div>
<div id="import-your-data-into-r" class="section level1">
<h1>Import your data into R</h1>
<ol style="list-style-type: decimal">
<li><p><strong>Prepare your data</strong> as specified here: <a href="https://www.sthda.com/english/english/wiki/best-practices-for-preparing-your-data-set-for-r">Best practices for preparing your data set for R</a></p></li>
<li><p><strong>Save your data</strong> in an external .txt tab or .csv files</p></li>
<li><p><strong>Import your data into R</strong> as follow:</p></li>
</ol>
<pre class="r"><code># If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())</code></pre>
<p>Here, we’ll use the built-in R data set named <a href="https://www.sthda.com/english/english/wiki/r-built-in-data-sets#toothgrowth">ToothGrowth</a>.</p>
<pre class="r"><code># Store the data in the variable my_data
my_data <- ToothGrowth</code></pre>
</div>
<div id="check-your-data" class="section level1">
<h1>Check your data</h1>
<p>We start by displaying a random sample of 10 rows using the function <strong>sample_n</strong>()[in <strong>dplyr</strong> package].</p>
<p>Show 10 random rows:</p>
<pre class="r"><code>set.seed(1234)
dplyr::sample_n(my_data, 10)</code></pre>
<pre><code>    len supp dose
7  11.2   VC  0.5
37  8.2   OJ  0.5
36 10.0   OJ  0.5
58 27.3   OJ  2.0
49 14.5   OJ  1.0
57 26.4   OJ  2.0
1   4.2   VC  0.5
13 15.2   VC  1.0
35 14.5   OJ  0.5
27 26.7   VC  2.0</code></pre>
</div>
<div id="assess-the-normality-of-the-data-in-r" class="section level1">
<h1>Assess the normality of the data in R</h1>
<p><span class="question">We want to test if the variable <em>len</em> (tooth length) is normally distributed.</span></p>
<div id="case-of-large-sample-sizes" class="section level2">
<h2>Case of large sample sizes</h2>
<p>If the sample size is large enough (n > 30), we can ignore the distribution of the data and use parametric tests.</p>
<p><span class="success"><strong>The central limit theorem</strong> tells us that no matter what distribution things have, the sampling distribution tends to be normal if the sample is large enough (n > 30).</span></p>
<p>However, to be consistent, normality can be checked by visual inspection [<strong>normal plots (histogram)</strong>, <strong>Q-Q plot</strong> (quantile-quantile plot)] or by <strong>significance tests</strong>].</p>
</div>
<div id="visual-methods" class="section level2">
<h2>Visual methods</h2>
<p><strong>Density plot</strong> and <strong>Q-Q plot</strong> can be used to check normality visually.</p>
<ol style="list-style-type: decimal">
<li><strong>Density plot</strong>: the <strong>density</strong> plot provides a visual judgment about whether the distribution is bell shaped.</li>
</ol>
<pre class="r"><code>library("ggpubr")
ggdensity(my_data$len, 
          main = "Density plot of tooth length",
          xlab = "Tooth length")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/normality-test-density-1.png" width="336" style="margin-bottom:10px;" /></p>
<ol start="2" style="list-style-type: decimal">
<li><strong>Q-Q plot</strong>: <strong>Q-Q plot</strong> (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted.</li>
</ol>
<pre class="r"><code>library(ggpubr)
ggqqplot(my_data$len)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/normality-test-qq-plot-1.png" width="384" style="margin-bottom:10px;" /></p>
<p>It’s also possible to use the function <strong>qqPlot</strong>() [in <strong>car</strong> package]:</p>
<pre class="r"><code>library("car")
qqPlot(my_data$len)</code></pre>
<p><span class="success">As all the points fall approximately along this reference line, we can assume normality.</span></p>
</div>
<div id="normality-test" class="section level2">
<h2>Normality test</h2>
<p>Visual inspection, described in the previous section, is usually unreliable. It’s possible to use a <strong>significance test</strong> comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.</p>
<p>There are several methods for <strong>normality test</strong> such as <strong>Kolmogorov-Smirnov (K-S) normality test</strong> and <strong>Shapiro-Wilk’s test</strong>.</p>
<p><span class="warning"> The null hypothesis of these tests is that “sample distribution is normal”. If the test is <strong>significant</strong>, the distribution is non-normal.<span></p>
<p><strong>Shapiro-Wilk’s method</strong> is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.</p>
<p><span class="notice">Note that, normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it’s important to combine visual inspection and significance test in order to take the right decision. </span></p>
<p>The R function <strong>shapiro.test</strong>() can be used to perform the Shapiro-Wilk test of normality for one variable (univariate):</p>
<pre class="r"><code>shapiro.test(my_data$len)</code></pre>
<pre><code>
    Shapiro-Wilk normality test
data:  my_data$len
W = 0.96743, p-value = 0.1091</code></pre>
<p><span class="success">From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.</span></p>
</div>
</div>
<div id="infos" class="section level1">
<h1>Infos</h1>
<p><span class="warning"> This analysis has been performed using <strong>R software</strong> (ver. 3.2.4). </span></p>
</div>
<script>jQuery(document).ready(function () {
    jQuery('#rdoc h1').addClass('wiki_paragraph1');
    jQuery('#rdoc h2').addClass('wiki_paragraph2');
    jQuery('#rdoc h3').addClass('wiki_paragraph3');
    jQuery('#rdoc h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- END HTML -->]]></description>
			<pubDate>Fri, 22 May 2020 01:00:44 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Statistical Tests and Assumptions]]></title>
			<link>https://www.sthda.com/english/wiki/statistical-tests-and-assumptions</link>
			<guid>https://www.sthda.com/english/wiki/statistical-tests-and-assumptions</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">
<div id="TOC">
<ul>
<li><a href="#research-questions-and-corresponding-statistical-tests">Research questions and corresponding statistical tests</a></li>
<li><a href="#statistical-test-requirements-assumptions">Statistical test requirements (assumptions)</a><ul>
<li><a href="#how-to-assess-the-normality-of-the-data">How to assess the normality of the data?</a></li>
<li><a href="#how-to-assess-the-equality-of-variances">How to assess the equality of variances?</a></li>
</ul></li>
<li><a href="#infos">Infos</a></li>
</ul>
</div>
<p><br/></p>
<br/>
<div class="block">
Here we’ll describe research questions and the corresponding statistical tests, as well as, the test assumptions.
</div>
<br/>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/images/statistical-test-assumptions.png" alt="Statistical tests and assumptions" /><p class="caption">Statistical tests and assumptions</p>
</div>
<div id="research-questions-and-corresponding-statistical-tests" class="section level1">
<h1>Research questions and corresponding statistical tests</h1>
<p>The most popular research questions include:</p>
<br/>
<div class="question">
<ol style="list-style-type: decimal">
<li>whether <strong>two variables</strong> (n = 2) are <strong>correlated</strong> (i.e., associated)</li>
<li>whether <strong>multiple variables</strong> (n > 2) are <strong>correlated</strong></li>
<li>whether <strong>two groups</strong> (n = 2) of samples <strong>differ</strong> from each other</li>
<li>whether <strong>multiple groups</strong> (n >= 2) of samples <strong>differ</strong> from each other</li>
<li>whether the <strong>variability</strong> of two samples differ</li>
</ol>
</div>
<p><br/></p>
<p>Each of these questions can be answered using the following statistical tests:</p>
<br/>
<div class="success">
<ol style="list-style-type: decimal">
<li><strong>Correlation test</strong> between two variables</li>
<li><strong>Correlation matrix</strong> between multiple variables</li>
<li><strong>Comparing the means of two groups</strong>:
<ul>
<li><strong>Student’s t-test</strong> (parametric)</li>
<li><strong>Wilcoxon rank test</strong> (non-parametric)</li>
</ul></li>
<li><strong>Comparing the means of more than two groups</strong>
<ul>
<li><strong>ANOVA test</strong> (analysis of variance, parametric): extension of t-test to compare more than two groups.</li>
<li><strong>Kruskal-Wallis rank sum test</strong> (non-parametric): extension of Wilcoxon rank test to compare more than two groups</li>
</ul></li>
<li><strong>Comparing the variances</strong>:
<ul>
<li>Comparing the variances of two groups: <strong>F-test</strong> (parametric)</li>
<li>Comparison of the variances of more than two groups: <strong>Bartlett’s test</strong> (parametric), <strong>Levene’s test</strong> (parametric) and <strong>Fligner-Killeen test</strong> (non-parametric)</li>
</ul></li>
</ol>
</div>
<p><br/></p>
</div>
<div id="statistical-test-requirements-assumptions" class="section level1">
<h1>Statistical test requirements (assumptions)</h1>
<p>Many of the statistical procedures including correlation, regression, t-test, and analysis of variance assume some certain characteristic about the data. Generally they assume that:</p>
<ul>
<li>the data are <strong>normally distributed</strong></li>
<li>and the <strong>variances</strong> of the groups to be compared are <strong>homogeneous</strong> (equal).</li>
</ul>
<p><span class="warning">These assumptions should be taken seriously to draw reliable interpretation and conclusions of the research.</span></p>
<p><span class="success">These tests - correlation, t-test and ANOVA - are called <strong>parametric tests</strong>, because their validity depends on the distribution of the data.</span></p>
<p>Before using parametric test, we should perform some <strong>preleminary tests</strong> to make sure that the test assumptions are met. In the situations where the assumptions are violated, <strong>non-paramatric</strong> tests are recommended.</p>
<div id="how-to-assess-the-normality-of-the-data" class="section level2">
<h2>How to assess the normality of the data?</h2>
<ol style="list-style-type: decimal">
<li><p>With <strong>large enough sample sizes</strong> (n > 30) the violation of the normality assumption should not cause major problems (central limit theorem). This implies that we can ignore the distribution of the data and use parametric tests.</p></li>
<li><p>However, to be consistent, we can use <strong>Shapiro-Wilk’s significance test</strong> comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.</p></li>
</ol>
</div>
<div id="how-to-assess-the-equality-of-variances" class="section level2">
<h2>How to assess the equality of variances?</h2>
<p>The standard <strong>Student’s t-test</strong> (comparing two independent samples) and the ANOVA test (comparing multiple samples) assume also that the samples to be compared have equal variances.</p>
<p>If the samples, being compared, follow normal distribution, then it’s possible to use:</p>
<ul>
<li><strong>F-test</strong> to compare the variances of two samples
</li>
<li><strong>Bartlett’s Test</strong> or <strong>Levene’s Test</strong> to compare the variances of multiple samples.</li>
</ul>
</div>
</div>
<div id="infos" class="section level1">
<h1>Infos</h1>
<p><span class="warning"> This analysis has been performed using <strong>R software</strong> (ver. 3.2.4). </span></p>
</div>
<script>jQuery(document).ready(function () {
    jQuery('#rdoc h1').addClass('wiki_paragraph1');
    jQuery('#rdoc h2').addClass('wiki_paragraph2');
    jQuery('#rdoc h3').addClass('wiki_paragraph3');
    jQuery('#rdoc h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- END HTML -->]]></description>
			<pubDate>Wed, 28 Sep 2016 21:09:13 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Descriptive Statistics and Graphics]]></title>
			<link>https://www.sthda.com/english/wiki/descriptive-statistics-and-graphics</link>
			<guid>https://www.sthda.com/english/wiki/descriptive-statistics-and-graphics</guid>
			<description><![CDATA[<!-- START HTML -->
  
            
  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#import-your-data-into-r">Import your data into R</a></li>
<li><a href="#check-your-data">Check your data</a></li>
<li><a href="#r-functions-for-computing-descriptive-statistics">R functions for computing descriptive statistics</a></li>
<li><a href="#descriptive-statistics-for-a-single-group">Descriptive statistics for a single group</a><ul>
<li><a href="#measure-of-central-tendency-mean-median-mode">Measure of central tendency: mean, median, mode</a></li>
<li><a href="#measure-of-variablity">Measure of variablity</a><ul>
<li><a href="#range-minimum-maximum">Range: minimum &amp; maximum</a></li>
<li><a href="#interquartile-range">Interquartile range</a></li>
<li><a href="#variance-and-standard-deviation">Variance and standard deviation</a></li>
<li><a href="#median-absolute-deviation">Median absolute deviation</a></li>
<li><a href="#which-measure-to-use">Which measure to use?</a></li>
</ul></li>
<li><a href="#computing-an-overall-summary-of-a-variable-and-an-entire-data-frame">Computing an overall summary of a variable and an entire data frame</a><ul>
<li><a href="#summary-function">summary() function</a></li>
<li><a href="#sapply-function">sapply() function</a></li>
<li><a href="#stat.desc-function">stat.desc() function</a></li>
</ul></li>
<li><a href="#case-of-missing-values">Case of missing values</a></li>
<li><a href="#graphical-display-of-distributions">Graphical display of distributions</a><ul>
<li><a href="#installation-and-loading-ggpubr">Installation and loading ggpubr</a></li>
<li><a href="#box-plots">Box plots</a></li>
<li><a href="#histogram">Histogram</a></li>
<li><a href="#empirical-cumulative-distribution-function-ecdf">Empirical cumulative distribution function (ECDF)</a></li>
<li><a href="#q-q-plots">Q-Q plots</a></li>
</ul></li>
</ul></li>
<li><a href="#descriptive-statistics-by-groups">Descriptive statistics by groups</a></li>
<li><a href="#frequency-tables">Frequency tables</a><ul>
<li><a href="#create-some-data">Create some data</a></li>
<li><a href="#simple-frequency-distribution-one-categorical-variable">Simple frequency distribution: one categorical variable</a></li>
<li><a href="#two-way-contingency-table-two-categorical-variables">Two-way contingency table: Two categorical variables</a></li>
<li><a href="#multiway-tables-more-than-two-categorical-variables">Multiway tables: More than two categorical variables</a></li>
<li><a href="#compute-table-margins-and-relative-frequency">Compute table margins and relative frequency</a></li>
</ul></li>
<li><a href="#infos">Infos</a></li>
</ul>
</div>

<p><br/></p>
<br/>
<div class="block">
<strong>Descriptive statistics</strong> consist of describing simply the data using some <strong>summary statistics</strong> and graphics. Here, we’ll describe how to compute summary statistics using <strong>R</strong> software.
</div>
<p><br/></p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/images/descriptive-statistics.png" alt="Descriptive statistics" /><p class="caption">Descriptive statistics</p>
</div>
<div id="import-your-data-into-r" class="section level1">
<h1>Import your data into R</h1>
<ol style="list-style-type: decimal">
<li><p><strong>Prepare your data</strong> as specified here: <a href="https://www.sthda.com/english/english/wiki/best-practices-for-preparing-your-data-set-for-r">Best practices for preparing your data set for R</a></p></li>
<li><p><strong>Save your data</strong> in an external .txt tab or .csv files</p></li>
<li><p><strong>Import your data into R</strong> as follow:</p></li>
</ol>
<pre class="r"><code># If .txt tab file, use this
my_data <- read.delim(file.choose())

# Or, if .csv file, use this
my_data <- read.csv(file.choose())</code></pre>
<p>Here, we’ll use the built-in R data set named <em>iris</em>.</p>
<pre class="r"><code># Store the data in the variable my_data
my_data <- iris</code></pre>
</div>
<div id="check-your-data" class="section level1">
<h1>Check your data</h1>
<p>You can inspect your data using the functions <strong>head</strong>() and <strong>tails</strong>(), which will display the first and the last part of the data, respectively.</p>
<pre class="r"><code># Print the first 6 rows
head(my_data, 6)</code></pre>
<pre><code>  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa</code></pre>
</div>
<div id="r-functions-for-computing-descriptive-statistics" class="section level1">
<h1>R functions for computing descriptive statistics</h1>
<p>Some R functions for computing descriptive statistics:</p>
<table>
<thead>
<tr class="header">
<th align="left">Description</th>
<th align="left">R function</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left"><strong>Mean</strong></td>
<td align="left"><strong>mean</strong>()</td>
</tr>
<tr class="even">
<td align="left"><strong>Standard deviation</strong></td>
<td align="left"><strong>sd</strong>()</td>
</tr>
<tr class="odd">
<td align="left"><strong>Variance</strong></td>
<td align="left"><strong>var</strong>()</td>
</tr>
<tr class="even">
<td align="left"><strong>Minimum</strong></td>
<td align="left"><strong>min</strong>()</td>
</tr>
<tr class="odd">
<td align="left"><strong>Maximum</strong></td>
<td align="left"><strong>maximum</strong>()</td>
</tr>
<tr class="even">
<td align="left"><strong>Median</strong></td>
<td align="left"><strong>median</strong>()</td>
</tr>
<tr class="odd">
<td align="left"><strong>Range of values</strong> (minimum and maximum)</td>
<td align="left"><strong>range</strong>()</td>
</tr>
<tr class="even">
<td align="left"><strong>Sample quantiles</strong></td>
<td align="left"><strong>quantile</strong>()</td>
</tr>
<tr class="odd">
<td align="left"><strong>Generic function</strong></td>
<td align="left"><strong>summary</strong>()</td>
</tr>
<tr class="even">
<td align="left"><strong>Interquartile range</strong></td>
<td align="left"><strong>IQR</strong>()</td>
</tr>
</tbody>
</table>
<p><span class="notice">The function <strong>mfv</strong>(), for most frequent value, [in <strong>modeest</strong> package] can be used to find the statistical mode of a numeric vector. </span></p>
</div>
<div id="descriptive-statistics-for-a-single-group" class="section level1">
<h1>Descriptive statistics for a single group</h1>
<div id="measure-of-central-tendency-mean-median-mode" class="section level2">
<h2>Measure of central tendency: mean, median, mode</h2>
<p>Roughly speaking, the central tendency measures the “average” or the “middle” of your data. The most commonly used measures include:</p>
<ul>
<li>the mean: the average value. It’s sensitive to outliers.</li>
<li>the median: the middle value. It’s a robust alternative to mean.</li>
<li>and the mode: the most frequent value</li>
</ul>
<p>In R,</p>
<ul>
<li>The function <strong>mean</strong>() and <strong>median</strong>() can be used to compute the mean and the median, respectively;</li>
<li>The function <strong>mfv</strong>() [in the <strong>modeest</strong> R package] can be used to compute the mode of a variable.</li>
</ul>
<p>The R code below computes the mean, median and the mode of the variable <em>Sepal.Length</em> [in <em>my_data</em> data set]:</p>
<pre class="r"><code># Compute the mean value
mean(my_data$Sepal.Length)</code></pre>
<pre><code>[1] 5.843333</code></pre>
<pre class="r"><code># Compute the median value
median(my_data$Sepal.Length)</code></pre>
<pre><code>[1] 5.8</code></pre>
<pre class="r"><code># Compute the mode
# install.packages("modeest")
require(modeest)
mfv(my_data$Sepal.Length)</code></pre>
<pre><code>[1] 5</code></pre>
</div>
<div id="measure-of-variablity" class="section level2">
<h2>Measure of variablity</h2>
<p>Measures of variability gives how “spread out” the data are.</p>
<div id="range-minimum-maximum" class="section level3">
<h3>Range: minimum &amp; maximum</h3>
<ul>
<li><strong>Range</strong> corresponds to biggest value minus the smallest value. It gives you the full spread of the data.</li>
</ul>
<pre class="r"><code># Compute the minimum value
min(my_data$Sepal.Length)</code></pre>
<pre><code>[1] 4.3</code></pre>
<pre class="r"><code># Compute the maximum value
max(my_data$Sepal.Length)</code></pre>
<pre><code>[1] 7.9</code></pre>
<pre class="r"><code># Range
range(my_data$Sepal.Length)</code></pre>
<pre><code>[1] 4.3 7.9</code></pre>
</div>
<div id="interquartile-range" class="section level3">
<h3>Interquartile range</h3>
<p>Recall that, quartiles divide the data into 4 parts. Note that, the <strong>interquartile range</strong> (IQR) - corresponding to the difference between the first and third quartiles - is sometimes used as a robust alternative to the standard deviation.</p>
<ul>
<li>R function:</li>
</ul>
<pre class="r"><code>quantile(x, probs = seq(0, 1, 0.25))</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong>: numeric vector whose sample quantiles are wanted.</li>
<li><strong>probs</strong>: numeric vector of probabilities with values in [0,1].</li>
</ul>
</div>
<p><br/></p>
<ul>
<li>Example:</li>
</ul>
<pre class="r"><code>quantile(my_data$Sepal.Length)</code></pre>
<pre><code>  0%  25%  50%  75% 100% 
 4.3  5.1  5.8  6.4  7.9 </code></pre>
<p><span class="success">By default, the function returns the minimum, the maximum and three <strong>quartiles</strong> (the 0.25, 0.50 and 0.75 quartiles).</span></p>
<p>To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this:</p>
<pre class="r"><code>quantile(my_data$Sepal.Length, seq(0, 1, 0.1))</code></pre>
<p>To compute the interquartile range, type this:</p>
<pre class="r"><code>IQR(my_data$Sepal.Length)</code></pre>
<pre><code>[1] 1.3</code></pre>
</div>
<div id="variance-and-standard-deviation" class="section level3">
<h3>Variance and standard deviation</h3>
<p>The variance represents the average squared deviation from the mean. The standard deviation is the square root of the variance. It measures the average deviation of the values, in the data, from the mean value.</p>
<pre class="r"><code># Compute the variance
var(my_data$Sepal.Length)
# Compute the standard deviation =
# square root of th variance
sd(my_data$Sepal.Length)</code></pre>
</div>
<div id="median-absolute-deviation" class="section level3">
<h3>Median absolute deviation</h3>
<p>The median absolute deviation (MAD) measures the deviation of the values, in the data, from the median value.</p>
<pre class="r"><code># Compute the median
median(my_data$Sepal.Length)
# Compute the median absolute deviation
mad(my_data$Sepal.Length)</code></pre>
</div>
<div id="which-measure-to-use" class="section level3">
<h3>Which measure to use?</h3>
<ul>
<li><strong>Range</strong>. It’s not often used because it’s very sensitive to outliers.</li>
<li><strong>Interquartile range</strong>. It’s pretty robust to outliers. It’s used a lot in combination with the median.</li>
<li><strong>Variance</strong>. It’s completely uninterpretable because it doesn’t use the same units as the data. It’s almost never used except as a mathematical tool</li>
<li><strong>Standard deviation</strong>. This is the square root of the variance. It’s expressed in the same units as the data. The standard deviation is often used in the situation where the mean is the measure of central tendency.</li>
<li><strong>Median absolute deviation</strong>. It’s a robust way to estimate the standard deviation, for data with outliers. It’s not used very often.</li>
</ul>
<p><span class="success">In summary, the IQR and the standard deviation are the two most common measures used to report the variability of the data.</span></p>
</div>
</div>
<div id="computing-an-overall-summary-of-a-variable-and-an-entire-data-frame" class="section level2">
<h2>Computing an overall summary of a variable and an entire data frame</h2>
<div id="summary-function" class="section level3">
<h3>summary() function</h3>
<p><span class="success">The function <strong>summary</strong>() can be used to display several statistic summaries of either one variable or an entire data frame.</span></p>
<ul>
<li><strong>Summary of a single variable</strong>. Five values are returned: the mean, median, 25th and 75th quartiles, min and max in one single line call:</li>
</ul>
<pre class="r"><code>summary(my_data$Sepal.Length)</code></pre>
<pre><code>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900 </code></pre>
<ul>
<li><strong>Summary of a data frame</strong>. In this case, the function <strong>summary</strong>() is automatically applied to each column. The format of the result depends on the type of the data contained in the column. For example:
<ul>
<li>If the column is a numeric variable, mean, median, min, max and quartiles are returned.</li>
<li>If the column is a factor variable, the number of observations in each group is returned.</li>
</ul></li>
</ul>
<pre class="r"><code>summary(my_data, digits = 1)</code></pre>
<pre><code>  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width        Species  
 Min.   :4     Min.   :2    Min.   :1     Min.   :0.1   setosa    :50  
 1st Qu.:5     1st Qu.:3    1st Qu.:2     1st Qu.:0.3   versicolor:50  
 Median :6     Median :3    Median :4     Median :1.3   virginica :50  
 Mean   :6     Mean   :3    Mean   :4     Mean   :1.2                  
 3rd Qu.:6     3rd Qu.:3    3rd Qu.:5     3rd Qu.:1.8                  
 Max.   :8     Max.   :4    Max.   :7     Max.   :2.5                  </code></pre>
</div>
<div id="sapply-function" class="section level3">
<h3>sapply() function</h3>
<p><span class="success"> It’s also possible to use the function <strong>sapply</strong>() to apply a particular function over a list or vector. For instance, we can use it, to compute for each column in a data frame, the mean, sd, var, min, quantile, …</span></p>
<pre class="r"><code># Compute the mean of each column
sapply(my_data[, -5], mean)</code></pre>
<pre><code>Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333 </code></pre>
<pre class="r"><code># Compute quartiles
sapply(my_data[, -5], quantile)</code></pre>
<pre><code>     Sepal.Length Sepal.Width Petal.Length Petal.Width
0%            4.3         2.0         1.00         0.1
25%           5.1         2.8         1.60         0.3
50%           5.8         3.0         4.35         1.3
75%           6.4         3.3         5.10         1.8
100%          7.9         4.4         6.90         2.5</code></pre>
</div>
<div id="stat.desc-function" class="section level3">
<h3>stat.desc() function</h3>
<p>The function <strong>stat.desc</strong>() [in <strong>pastecs</strong> package], provides other useful statistics including:</p>
<ul>
<li>the median</li>
<li>the mean</li>
<li>the standard error on the mean (SE.mean)</li>
<li>the confidence interval of the mean (CI.mean) at the p level (default is 0.95)</li>
<li>the variance (var)</li>
<li>the standard deviation (std.dev)</li>
<li><p>and the variation coefficient (coef.var) defined as the standard deviation divided by the mean</p></li>
<li><p>Install <strong>pastecs</strong> package</p></li>
</ul>
<pre class="r"><code>install.packages("pastecs")</code></pre>
<ul>
<li>Use the function <strong>stat.desc</strong>() to compute descriptive statistics</li>
</ul>
<pre class="r"><code># Compute descriptive statistics
library(pastecs)
res <- stat.desc(my_data[, -5])
round(res, 2)</code></pre>
<pre><code>             Sepal.Length Sepal.Width Petal.Length Petal.Width
nbr.val            150.00      150.00       150.00      150.00
nbr.null             0.00        0.00         0.00        0.00
nbr.na               0.00        0.00         0.00        0.00
min                  4.30        2.00         1.00        0.10
max                  7.90        4.40         6.90        2.50
range                3.60        2.40         5.90        2.40
sum                876.50      458.60       563.70      179.90
median               5.80        3.00         4.35        1.30
mean                 5.84        3.06         3.76        1.20
SE.mean              0.07        0.04         0.14        0.06
CI.mean.0.95         0.13        0.07         0.28        0.12
var                  0.69        0.19         3.12        0.58
std.dev              0.83        0.44         1.77        0.76
coef.var             0.14        0.14         0.47        0.64</code></pre>
</div>
</div>
<div id="case-of-missing-values" class="section level2">
<h2>Case of missing values</h2>
<p><span class="warning" =="">Note that, when the data contains missing values, some R functions will return errors or NA even if just a single value is missing.</span></p>
<p>For example, the <strong>mean()</strong> function will return NA if even only one value is missing in a vector. This can be avoided using the argument <strong>na.rm = TRUE</strong>, which tells to the function to remove any NAs before calculations. An example using the <strong>mean</strong> function is as follow:</p>
<pre class="r"><code>mean(my_data$Sepal.Length, na.rm = TRUE)</code></pre>
</div>
<div id="graphical-display-of-distributions" class="section level2">
<h2>Graphical display of distributions</h2>
<p>The R package <a href="https://www.sthda.com/english/english/wiki/ggpubr-r-package-ggplot2-based-publication-ready-plots"><strong>ggpubr</strong></a> will be used to create graphs.</p>
<div id="installation-and-loading-ggpubr" class="section level3">
<h3>Installation and loading ggpubr</h3>
<ul>
<li>Install the latest version from GitHub as follow:</li>
</ul>
<pre class="r"><code># Install
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")</code></pre>
<ul>
<li>Or, install from CRAN as follow:</li>
</ul>
<pre class="r"><code>install.packages("ggpubr")</code></pre>
<ul>
<li>Load ggpubr as follow:</li>
</ul>
<pre class="r"><code>library(ggpubr)</code></pre>
</div>
<div id="box-plots" class="section level3">
<h3>Box plots</h3>
<pre class="r"><code>ggboxplot(my_data, y = "Sepal.Length", width = 0.5)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-box-plots-1.png" width="259.2" style="margin-bottom:10px;" /></p>
</div>
<div id="histogram" class="section level3">
<h3>Histogram</h3>
<br/>
<div class="block">
Histograms show the number of observations that fall within specified divisions (i.e., bins).
</div>
<p><br/></p>
<p>Histogram plot of Sepal.Length with mean line (dashed line).</p>
<pre class="r"><code>gghistogram(my_data, x = "Sepal.Length", bins = 9, 
             add = "mean")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-histogram-1.png" width="259.2" style="margin-bottom:10px;" /></p>
</div>
<div id="empirical-cumulative-distribution-function-ecdf" class="section level3">
<h3>Empirical cumulative distribution function (ECDF)</h3>
<br/>
<div class="block">
ECDF is the fraction of data smaller than or equal to x.
</div>
<p><br/></p>
<pre class="r"><code>ggecdf(my_data, x = "Sepal.Length")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-ecdf-1.png" width="259.2" style="margin-bottom:10px;" /></p>
</div>
<div id="q-q-plots" class="section level3">
<h3>Q-Q plots</h3>
<br/>
<div class="block">
QQ plots is used to check whether the data is normally distributed.
</div>
<p><br/></p>
<pre class="r"><code>ggqqplot(my_data, x = "Sepal.Length")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-qqplots-1.png" width="259.2" style="margin-bottom:10px;" /></p>
</div>
</div>
</div>
<div id="descriptive-statistics-by-groups" class="section level1">
<h1>Descriptive statistics by groups</h1>
<p>To compute summary statistics by groups, the functions <strong>group_by</strong>() and <strong>summarise</strong>() [in <strong>dplyr</strong> package] can be used.</p>
<ul>
<li>We want to group the data by <em>Species</em> and then:
<ul>
<li>compute the number of element in each group. R function: <strong>n</strong>()</li>
<li>compute the mean. R function <strong>mean</strong>()</li>
<li>and the standard deviation. R function <strong>sd</strong>()</li>
</ul></li>
</ul>
<p><span class="notice">The function <strong>%>%</strong> is used to chaine operations.</span></p>
<ul>
<li>Install <strong>ddplyr</strong> as follow:</li>
</ul>
<pre class="r"><code>install.packages("dplyr")</code></pre>
<ul>
<li>Descriptive statistics by groups:</li>
</ul>
<pre class="r"><code>library(dplyr)
group_by(my_data, Species) %>% 
summarise(
  count = n(), 
  mean = mean(Sepal.Length, na.rm = TRUE),
  sd = sd(Sepal.Length, na.rm = TRUE)
  )</code></pre>
<pre><code>Source: local data frame [3 x 4]

     Species count  mean        sd
      (fctr) (int) (dbl)     (dbl)
1     setosa    50 5.006 0.3524897
2 versicolor    50 5.936 0.5161711
3  virginica    50 6.588 0.6358796</code></pre>
<ul>
<li>Graphics for grouped data:</li>
</ul>
<pre class="r"><code>library("ggpubr")
# Box plot colored by groups: Species
ggboxplot(my_data, x = "Species", y = "Sepal.Length",
          color = "Species",
          palette = c("#00AFBB", "#E7B800", "#FC4E07"))</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-grouped-data-1.png" width="384" style="margin-bottom:10px;" /></p>
<pre class="r"><code># Stripchart colored by groups: Species
ggstripchart(my_data, x = "Species", y = "Sepal.Length",
          color = "Species",
          palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          add = "mean_sd")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-grouped-data-2.png" width="384" style="margin-bottom:10px;" /></p>
<p><span class="warning">Note that, when the number of observations per groups is small, it’s recommended to use <strong>strip chart</strong> compared to box plots.</span></p>
</div>
<div id="frequency-tables" class="section level1">
<h1>Frequency tables</h1>
<p>A frequency table (or contingency table) is used to describe categorical variables. It contains the counts at each combination of factor levels.</p>
<p>R function to generate tables: <strong>table</strong>()</p>
<div id="create-some-data" class="section level2">
<h2>Create some data</h2>
<p>Distribution of hair and eye color by sex of 592 students:</p>
<pre class="r"><code># Hair/eye color data
df <- as.data.frame(HairEyeColor)
hair_eye_col <- df[rep(row.names(df), df$Freq), 1:3]
rownames(hair_eye_col) <- 1:nrow(hair_eye_col)
head(hair_eye_col)</code></pre>
<pre><code>   Hair   Eye  Sex
1 Black Brown Male
2 Black Brown Male
3 Black Brown Male
4 Black Brown Male
5 Black Brown Male
6 Black Brown Male</code></pre>
<pre class="r"><code># hair/eye variables
Hair <- hair_eye_col$Hair
Eye <- hair_eye_col$Eye</code></pre>
</div>
<div id="simple-frequency-distribution-one-categorical-variable" class="section level2">
<h2>Simple frequency distribution: one categorical variable</h2>
<ul>
<li>Table of counts</li>
</ul>
<pre class="r"><code># Frequency distribution of hair color
table(Hair)</code></pre>
<pre><code>Hair
Black Brown   Red Blond 
  108   286    71   127 </code></pre>
<pre class="r"><code># Frequency distribution of eye color
table(Eye)</code></pre>
<pre><code>Eye
Brown  Blue Hazel Green 
  220   215    93    64 </code></pre>
<ul>
<li>Graphics: to create the graphics, we start by converting the table as a data frame.</li>
</ul>
<pre class="r"><code># Compute table and convert as data frame
df <- as.data.frame(table(Hair))
df</code></pre>
<pre><code>   Hair Freq
1 Black  108
2 Brown  286
3   Red   71
4 Blond  127</code></pre>
<pre class="r"><code># Visualize using bar plot
library(ggpubr)
ggbarplot(df, x = "Hair", y = "Freq")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-unnamed-chunk-25-1.png" width="259.2" style="margin-bottom:10px;" /></p>
</div>
<div id="two-way-contingency-table-two-categorical-variables" class="section level2">
<h2>Two-way contingency table: Two categorical variables</h2>
<pre class="r"><code>tbl2 <- table(Hair , Eye)
tbl2</code></pre>
<pre><code>       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16</code></pre>
<p><span class="warning">It’s also possible to use the function <strong>xtabs</strong>(), which will create cross tabulation of data frames with a formula interface.</span></p>
<pre class="r"><code>xtabs(~ Hair + Eye, data = hair_eye_col)</code></pre>
<ul>
<li>Graphics: to create the graphics, we start by converting the table as a data frame.</li>
</ul>
<pre class="r"><code>df <- as.data.frame(tbl2)
head(df)</code></pre>
<pre><code>   Hair   Eye Freq
1 Black Brown   68
2 Brown Brown  119
3   Red Brown   26
4 Blond Brown    7
5 Black  Blue   20
6 Brown  Blue   84</code></pre>
<pre class="r"><code># Visualize using bar plot
library(ggpubr)
ggbarplot(df, x = "Hair", y = "Freq",
          color = "Eye", 
          palette = c("brown", "blue", "gold", "green"))</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-barplot-1.png" width="384" style="margin-bottom:10px;" /></p>
<pre class="r"><code># position dodge
ggbarplot(df, x = "Hair", y = "Freq",
          color = "Eye", position = position_dodge(),
          palette = c("brown", "blue", "gold", "green"))</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/statistics/descriptive-statistics-barplot2-1.png" width="480" style="margin-bottom:10px;" /></p>
</div>
<div id="multiway-tables-more-than-two-categorical-variables" class="section level2">
<h2>Multiway tables: More than two categorical variables</h2>
<ul>
<li>Hair and Eye color distributions by sex using <strong>xtabs</strong>():</li>
</ul>
<pre class="r"><code>xtabs(~Hair + Eye + Sex, data = hair_eye_col)</code></pre>
<pre><code>, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8</code></pre>
<ul>
<li>You can also use the function <strong>ftable</strong>() [for flat contingency tables]. It returns a nice output compared to xtabs() when you have more than two variables:</li>
</ul>
<pre class="r"><code>ftable(Sex + Hair ~ Eye, data = hair_eye_col)</code></pre>
<pre><code>      Sex   Male                 Female                
      Hair Black Brown Red Blond  Black Brown Red Blond
Eye                                                    
Brown         32    53  10     3     36    66  16     4
Blue          11    50  10    30      9    34   7    64
Hazel         10    25   7     5      5    29   7     5
Green          3    15   7     8      2    14   7     8</code></pre>
</div>
<div id="compute-table-margins-and-relative-frequency" class="section level2">
<h2>Compute table margins and relative frequency</h2>
<p><span class="success"><strong>Table margins</strong> correspond to the sums of counts along rows or columns of the table. <strong>Relative frequencies</strong> express table entries as proportions of table margins (i.e., row or column totals).</span></p>
<p>The function <strong>margin.table</strong>() and <strong>prop.table</strong>() can be used to compute table margins and relative frequencies, respectively.</p>
<ol style="list-style-type: decimal">
<li><strong>Format of the functions</strong>:</li>
</ol>
<pre class="r"><code>margin.table(x, margin = NULL)

prop.table(x, margin = NULL)</code></pre>
<ul>
<li><strong>x</strong>: table</li>
<li><strong>margin</strong>: index number (1 for rows and 2 for columns)</li>
</ul>
<ol start="2" style="list-style-type: decimal">
<li><strong>compute table margins</strong>:</li>
</ol>
<pre class="r"><code>Hair <- hair_eye_col$Hair
Eye <- hair_eye_col$Eye
# Hair/Eye color table
he.tbl <- table(Hair, Eye)
he.tbl</code></pre>
<pre><code>       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16</code></pre>
<pre class="r"><code># Margin of rows
margin.table(he.tbl, 1)</code></pre>
<pre><code>Hair
Black Brown   Red Blond 
  108   286    71   127 </code></pre>
<pre class="r"><code># Margin of columns
margin.table(he.tbl, 2)</code></pre>
<pre><code>Eye
Brown  Blue Hazel Green 
  220   215    93    64 </code></pre>
<ol start="3" style="list-style-type: decimal">
<li><strong>Compute relative frequencies</strong>:</li>
</ol>
<pre class="r"><code># Frequencies relative to row total
prop.table(he.tbl, 1)</code></pre>
<pre><code>       Eye
Hair         Brown       Blue      Hazel      Green
  Black 0.62962963 0.18518519 0.13888889 0.04629630
  Brown 0.41608392 0.29370629 0.18881119 0.10139860
  Red   0.36619718 0.23943662 0.19718310 0.19718310
  Blond 0.05511811 0.74015748 0.07874016 0.12598425</code></pre>
<pre class="r"><code># Table of percentages
round(prop.table(he.tbl, 1), 2)*100</code></pre>
<pre><code>       Eye
Hair    Brown Blue Hazel Green
  Black    63   19    14     5
  Brown    42   29    19    10
  Red      37   24    20    20
  Blond     6   74     8    13</code></pre>
<p>To express the frequencies relative to the grand total, use this:</p>
<pre class="r"><code>he.tbl/sum(he.tbl)</code></pre>
</div>
</div>
<div id="infos" class="section level1">
<h1>Infos</h1>
<p><span class="warning"> This analysis has been performed using <strong>R software</strong> (ver. 3.2.4). </span></p>
</div>

<script>jQuery(document).ready(function () {
    jQuery('#rdoc h1').addClass('wiki_paragraph1');
    jQuery('#rdoc h2').addClass('wiki_paragraph2');
    jQuery('#rdoc h3').addClass('wiki_paragraph3');
    jQuery('#rdoc h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- END HTML -->]]></description>
			<pubDate>Sun, 25 Sep 2016 19:00:06 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[R Basic Statistics]]></title>
			<link>https://www.sthda.com/english/wiki/r-basic-statistics</link>
			<guid>https://www.sthda.com/english/wiki/r-basic-statistics</guid>
			<description><![CDATA[This chapter describes how to <strong>perform statistic tests with R</strong>. Please scroll down to the bottom of this page to see the available articles.]]></description>
			<pubDate>Tue, 11 Nov 2014 09:18:38 +0100</pubDate>
			
		</item>
		
	</channel>
</rss>
