<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Wed, 13 May 2026 07:48:40 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Easy Guides]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/wiki/10" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles of the category: Genomics]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[fastqcr: An R Package Facilitating Quality Controls of Sequencing Data for Large Numbers of Samples]]></title>
			<link>https://www.sthda.com/english/wiki/fastqcr-an-r-package-facilitating-quality-controls-of-sequencing-data-for-large-numbers-of-samples</link>
			<guid>https://www.sthda.com/english/wiki/fastqcr-an-r-package-facilitating-quality-controls-of-sequencing-data-for-large-numbers-of-samples</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<p><br/></p>
<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p><strong>High throughput sequencing</strong> data can contain hundreds of millions of sequences (also known as reads).</p>
<p>The raw sequencing reads may contain PCR primers, adaptors, low quality bases, duplicates and other contaminants coming from the experimental protocols. As these may affect the results of downstream analysis, it’s essential to perform some <strong>quality control</strong> (QC) checks to ensure that the raw data looks good and there are no problems in your data.</p>
<p>The <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/"><strong>FastQC</strong></a> tool, written by Simon Andrews at the Babraham Institute, is the most widely used tool to perform <strong>quality control</strong> for high throughput sequence data. To learn more about the FastQC tool, see this <a href="https://www.youtube.com/watch?v=bz93ReOv87Y">Video Tuorial</a>.</p>
<p>It produces, for each sample, an html report and a ‘zip’ file, which contains a file called fastqc_data.txt and summary.txt.</p>
<p>If you have hundreds of samples, you’re not going to open up each HTML page. You need some way of looking at these data in aggregate.</p>
<p>Therefore, we developed the <strong>fastqcr</strong> R package, which contains helper functions to easily and automatically parse, aggregate and analyze FastQC reports for large numbers of samples.</p>
<p>Additionally, the <strong>fastqcr</strong> package provides a convenient solution for building a multi-QC report and a one-sample FastQC report with the result interpretations. The online documentation is available at: <a href="https://www.sthda.com/english/rpkgs/fastqcr/" class="uri">https://www.sthda.com/english/rpkgs/fastqcr/</a>.</p>
<p>Examples of QC reports, generated automatically by the <strong>fastqcr</strong> R package, include:</p>
<ul>
<li><a href="https://www.sthda.com/english/rpkgs/fastqcr/qc-reports/fastqcr-multi-qc-report.html">Multi-QC report for multiple samples</a></li>
<li><a href="https://www.sthda.com/english/rpkgs/fastqcr/qc-reports/sample-qc-report-interpretation.html">One sample QC report (+ interpretation)</a></li>
<li><a href="https://www.sthda.com/english/rpkgs/fastqcr/qc-reports/sample-qc-report-without-interpretation.html">One sample QC report (no interpretation)</a></li>
</ul>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/fastqcr.png" alt="Main functions in the fastqcr package" /></p>
<div class="block">
<p>
In this article, we’ll demonstrate how to perform a quality control of sequencing data. We start by describing how to install and use the <strong>FastQC</strong> tool. Finally, we’ll describe the <strong>fastqcr</strong> R package to easily aggregate and analyze FastQC reports for large numbers of samples.
</p>
</div>
<br/>
<p><strong>Contents</strong>:</p>

<div id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#installation-and-loading-fastqcr">Installation and loading fastqcr</a></li>
<li><a href="#quick-start">Quick Start</a></li>
<li><a href="#main-functions">Main Functions</a></li>
<li><a href="#installing-fastqc-from-r">Installing FastQC from R</a></li>
<li><a href="#running-fastqc-from-r">Running FastQC from R</a></li>
<li><a href="#fastqc-reports">FastQC Reports</a></li>
<li><a href="#aggregating-reports">Aggregating Reports</a></li>
<li><a href="#summarizing-reports">Summarizing Reports</a></li>
<li><a href="#inspecting-problems">Inspecting Problems</a></li>
<li><a href="#building-an-html-report">Building an HTML Report</a></li>
<li><a href="#importing-and-plotting-a-fastqc-qc-report">Importing and Plotting a FastQC QC Report</a></li>
<li><a href="#interpreting-fastqc-reports">Interpreting FastQC Reports</a></li>
<li><a href="#useful-links">Useful Links</a></li>
<li><a href="#infos">Infos</a></li>
</ul>
</div>
<br/>

</div>
<div id="installation-and-loading-fastqcr" class="section level2">
<h2>Installation and loading fastqcr</h2>
<ul>
<li>fastqcr can be installed from <a href="https://cran.r-project.org/package=fastqcr">CRAN</a> as follow:</li>
</ul>
<pre class="r"><code>install.packages("fastqcr")</code></pre>
<ul>
<li>Or, install the latest version from <a href="https://github.com/kassambara/fastqcr">GitHub</a>:</li>
</ul>
<pre class="r"><code>if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/fastqcr")</code></pre>
<ul>
<li>Load fastqcr:</li>
</ul>
<pre class="r"><code>library("fastqcr")</code></pre>
</div>
<div id="quick-start" class="section level2">
<h2>Quick Start</h2>
<pre class="r"><code>library(fastqcr)

# Aggregating Multiple FastQC Reports into a Data Frame 
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# Demo QC directory containing zipped FASTQC reports
qc.dir <- system.file("fastqc_results", package = "fastqcr")
qc <- qc_aggregate(qc.dir)
qc

# Inspecting QC Problems
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# See which modules failed in the most samples
qc_fails(qc, "module")
# Or, see which samples failed the most
qc_fails(qc, "sample")

# Building Multi QC Reports
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
qc_report(qc.dir, result.file = "multi-qc-report" )

# Building One-Sample QC Reports (+ Interpretation)
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr")
qc_report(qc.file, result.file = "one-sample-report",
          interpret = TRUE)</code></pre>
</div>
<div id="main-functions" class="section level2">
<h2>Main Functions</h2>
<p><strong>1) Installing and Running FastQC</strong></p>
<ul>
<li><p><strong>fastqc_install</strong>(): Install the latest version of FastQC tool on Unix systems (MAC OSX and Linux)</p></li>
<li><p><strong>fastqc</strong>(): Run the FastQC tool from R.</p></li>
</ul>
<p><strong>2) Aggregating and Summarizing Multiple FastQC Reports</strong></p>
<ul>
<li><p><strong>qc <- qc_aggregate</strong>(): Aggregate multiple FastQC reports into a data frame.</p></li>
<li><p><strong>summary</strong>(qc): Generates a summary of qc_aggregate.</p></li>
<li><p><strong>qc_stats</strong>(qc): General statistics of FastQC reports.</p></li>
</ul>
<p><strong>3) Inspecting Problems</strong></p>
<ul>
<li><p><strong>qc_fails</strong>(qc): Displays samples or modules that failed.</p></li>
<li><p><strong>qc_warns</strong>(qc): Displays samples or modules that warned.</p></li>
<li><p><strong>qc_problems</strong>(qc): Union of <strong>qc_fails</strong>() and <strong>qc_warns</strong>(). Display which samples or modules that failed or warned.</p></li>
</ul>
<p><strong>4) Importing and Plotting FastQC Reports</strong></p>
<ul>
<li><p><strong>qc_read</strong>(): Read FastQC data into R.</p></li>
<li><p><strong>qc_plot</strong>(qc): Plot FastQC data</p></li>
</ul>
<p><strong>5) Building One-Sample and Multi-QC Reports</strong></p>
<ul>
<li><strong>qc_report</strong>(): Create an HTML file containing FastQC reports of one or multiple files. Inputs can be either a directory containing multiple FastQC reports or a single sample FastQC report.</li>
</ul>
<p><strong>6) Others</strong></p>
<ul>
<li><strong>qc_unzip</strong>(): Unzip all zipped files in the qc.dir directory. <br/></li>
</ul>
</div>
<div id="installing-fastqc-from-r" class="section level2">
<h2>Installing FastQC from R</h2>
<p>You can install automatically the FastQC tool from R as follow:</p>
<pre class="r"><code>fastqc_install()</code></pre>
</div>
<div id="running-fastqc-from-r" class="section level2">
<h2>Running FastQC from R</h2>
<p>The supported <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/2%20Basic%20Operations/2.1%20Opening%20a%20sequence%20file.html">file formats</a> by FastQC include:</p>
<ul>
<li>FASTQ</li>
<li>gzip compressed FASTQ</li>
</ul>
<p>Suppose that your working directory is organized as follow:</p>
<ul>
<li>home
<ul>
<li>Documents
<ul>
<li>FASTQ</li>
</ul></li>
</ul></li>
</ul>
<p>where, FASTQ is the directory containing your FASTQ files, for which you want to perform the quality control check.</p>
<p>To run FastQC from R, type this:</p>
<pre class="r"><code>fastqc(fq.dir = "~/Documents/FASTQ", # FASTQ files directory
       qc.dir = "~/Documents/FASTQC", # Results direcory
       threads = 4                    # Number of threads
       )</code></pre>
</div>
<div id="fastqc-reports" class="section level2">
<h2>FastQC Reports</h2>
<p>For each sample, FastQC performs a series of tests called <em>analysis modules</em>.</p>
<p>These modules include:</p>
<ul>
<li>Basic Statistics,</li>
<li>Per base sequence quality,</li>
<li>Per tile sequence quality</li>
<li>Per sequence quality scores,</li>
<li>Per base sequence content,</li>
<li>Per sequence GC content,</li>
<li>Per base N content,</li>
<li>Sequence Length Distribution,</li>
<li>Sequence Duplication Levels,</li>
<li>Overrepresented sequences,</li>
<li>Adapter Content</li>
<li>Kmer content</li>
</ul>
<p>The interpretation of these modules are provided in the official documentation of the <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/">FastQC tool</a>.</p>
</div>
<div id="aggregating-reports" class="section level2">
<h2>Aggregating Reports</h2>
<p>Here, we provide an R function <strong>qc_aggregate()</strong> to walk the FastQC result directory, find all the FASTQC zipped output folders, read the <strong>fastqc_data.txt</strong> and the <strong>summary.txt</strong> files, and aggregate the information into a data frame.</p>
<p>The fastqc_data.txt file contains the raw data and statistics while the summary.txt file summarizes which tests have been passed.</p>
<p>In the example below, we’ll use a demo FastQC output directory available in the fastqcr package.</p>
<pre class="r"><code>library(fastqcr)
# Demo QC dir
qc.dir <- system.file("fastqc_results", package = "fastqcr")
qc.dir</code></pre>
<pre><code>## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/fastqcr/fastqc_results"</code></pre>
<pre class="r"><code># List of files in the directory
list.files(qc.dir)</code></pre>
<pre><code>## [1] "S1_fastqc.zip" "S2_fastqc.zip" "S3_fastqc.zip" "S4_fastqc.zip" "S5_fastqc.zip"</code></pre>
<p>The demo QC directory contains five zipped folders corresponding to the FastQC output for 5 samples.</p>
<p>Aggregating FastQC reports:</p>
<pre class="r"><code>qc <- qc_aggregate(qc.dir)
qc</code></pre>
<p>The aggregated report looks like this:</p>
<table>
<thead>
<tr class="header">
<th align="left">sample</th>
<th align="left">module</th>
<th align="left">status</th>
<th align="left">tot.seq</th>
<th align="left">seq.length</th>
<th align="right">pct.gc</th>
<th align="right">pct.dup</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">S4</td>
<td align="left">Per tile sequence quality</td>
<td align="left">PASS</td>
<td align="left">67255341</td>
<td align="left">35-76</td>
<td align="right">49</td>
<td align="right">19.89</td>
</tr>
<tr class="even">
<td align="left">S3</td>
<td align="left">Per base sequence quality</td>
<td align="left">PASS</td>
<td align="left">67255341</td>
<td align="left">35-76</td>
<td align="right">49</td>
<td align="right">22.14</td>
</tr>
<tr class="odd">
<td align="left">S3</td>
<td align="left">Per base N content</td>
<td align="left">PASS</td>
<td align="left">67255341</td>
<td align="left">35-76</td>
<td align="right">49</td>
<td align="right">22.14</td>
</tr>
<tr class="even">
<td align="left">S5</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
<td align="left">65011962</td>
<td align="left">35-76</td>
<td align="right">48</td>
<td align="right">18.15</td>
</tr>
<tr class="odd">
<td align="left">S2</td>
<td align="left">Sequence Duplication Levels</td>
<td align="left">PASS</td>
<td align="left">50299587</td>
<td align="left">35-76</td>
<td align="right">48</td>
<td align="right">15.70</td>
</tr>
<tr class="even">
<td align="left">S1</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
<td align="left">50299587</td>
<td align="left">35-76</td>
<td align="right">48</td>
<td align="right">17.24</td>
</tr>
<tr class="odd">
<td align="left">S1</td>
<td align="left">Overrepresented sequences</td>
<td align="left">PASS</td>
<td align="left">50299587</td>
<td align="left">35-76</td>
<td align="right">48</td>
<td align="right">17.24</td>
</tr>
<tr class="even">
<td align="left">S3</td>
<td align="left">Basic Statistics</td>
<td align="left">PASS</td>
<td align="left">67255341</td>
<td align="left">35-76</td>
<td align="right">49</td>
<td align="right">22.14</td>
</tr>
<tr class="odd">
<td align="left">S1</td>
<td align="left">Basic Statistics</td>
<td align="left">PASS</td>
<td align="left">50299587</td>
<td align="left">35-76</td>
<td align="right">48</td>
<td align="right">17.24</td>
</tr>
<tr class="even">
<td align="left">S4</td>
<td align="left">Overrepresented sequences</td>
<td align="left">PASS</td>
<td align="left">67255341</td>
<td align="left">35-76</td>
<td align="right">49</td>
<td align="right">19.89</td>
</tr>
</tbody>
</table>
<p>Column names:</p>
<ul>
<li><strong>sample</strong>: sample names</li>
<li><strong>module</strong>: fastqc modules</li>
<li><strong>status</strong>: fastqc module status for each sample</li>
<li><strong>tot.seq</strong>: total sequences (i.e.: the number of reads)</li>
<li><strong>seq.length</strong>: sequence length</li>
<li><strong>pct.gc</strong>: percentage of GC content</li>
<li><strong>pct.dup</strong>: percentage of duplicate reads</li>
</ul>
<div class="block">
<p>
The table shows, for each sample, the names of tested FastQC modules, the status of the test, as well as, some general statistics including the number of reads, the length of reads, the percentage of GC content and the percentage of duplicate reads.
</p>
</div>
<p>Once you have the aggregated data you can use the <strong>dplyr</strong> package to easily inspect modules that failed or warned in samples. For example, the following R code shows samples with warnings and/or failures:</p>
<pre class="r"><code>library(dplyr)
qc %>%
  select(sample, module, status) %>%    
  filter(status %in% c("WARN", "FAIL")) %>%
  arrange(sample)</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">sample</th>
<th align="left">module</th>
<th align="left">status</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">S1</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S1</td>
<td align="left">Per sequence GC content</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S1</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">S2</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">S2</td>
<td align="left">Per sequence GC content</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">S2</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S3</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S3</td>
<td align="left">Per sequence GC content</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">S3</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">S4</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">S4</td>
<td align="left">Per sequence GC content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S4</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S5</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S5</td>
<td align="left">Per sequence GC content</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S5</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
</tbody>
</table>
</div>
<div class="success">
<p>
In the next section, we’ll describe some easy-to-use functions, available in the <strong>fastqcr</strong> package, for analyzing the aggregated data.
</p>
</div>
</div>
<div id="summarizing-reports" class="section level2">
<h2>Summarizing Reports</h2>
<p>We start by presenting a summary and general statistics of the aggregated data.</p>
<div id="qc-summary" class="section level3">
<h3>QC Summary</h3>
<ul>
<li>R function: <strong>summary</strong>()

</li>
<li>Input data: aggregated data from <strong>qc_aggregate</strong>()</li>
</ul>
<pre class="r"><code># Summary of qc
summary(qc)</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">module</th>
<th align="right">nb_samples</th>
<th align="right">nb_fail</th>
<th align="right">nb_pass</th>
<th align="right">nb_warn</th>
<th align="left">failed</th>
<th align="left">warned</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Adapter Content</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="even">
<td align="left">Basic Statistics</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="odd">
<td align="left">Kmer Content</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="even">
<td align="left">Overrepresented sequences</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="odd">
<td align="left">Per base N content</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="even">
<td align="left">Per base sequence content</td>
<td align="right">5</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="left">S1, S2, S3, S4, S5</td>
<td align="left">NA</td>
</tr>
<tr class="odd">
<td align="left">Per base sequence quality</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="right">2</td>
<td align="right">0</td>
<td align="right">3</td>
<td align="left">S3, S4</td>
<td align="left">S1, S2, S5</td>
</tr>
<tr class="odd">
<td align="left">Per sequence quality scores</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="even">
<td align="left">Per tile sequence quality</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="odd">
<td align="left">Sequence Duplication Levels</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="left">NA</td>
<td align="left">NA</td>
</tr>
<tr class="even">
<td align="left">Sequence Length Distribution</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">0</td>
<td align="right">5</td>
<td align="left">NA</td>
<td align="left">S1, S2, S3, S4, S5</td>
</tr>
</tbody>
</table>
</div>
<p>Column names:</p>
<ul>
<li><em>module</em>: fastqc modules</li>
<li><em>nb_samples</em>: the number of samples tested</li>
<li><em>nb_pass, nb_fail, nb_warn</em>: the number of samples that passed, failed and warned, respectively.</li>
<li><em>failed, warned</em>: the name of samples that failed and warned, respectively.</li>
</ul>
<div class="block">
<p>
The table shows, for each FastQC module, the number and the name of samples that failed or warned.
</p>
</div>
</div>
<div id="general-statistics" class="section level3">
<h3>General statistics</h3>
<ul>
<li>R function: <strong>qc_stats</strong>()</li>
<li>Input data: aggregated data from <strong>qc_aggregate</strong>()</li>
</ul>
<pre class="r"><code>qc_stats(qc)</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">sample</th>
<th align="right">pct.dup</th>
<th align="right">pct.gc</th>
<th align="left">tot.seq</th>
<th align="left">seq.length</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">S1</td>
<td align="right">17.24</td>
<td align="right">48</td>
<td align="left">50299587</td>
<td align="left">35-76</td>
</tr>
<tr class="even">
<td align="left">S2</td>
<td align="right">15.70</td>
<td align="right">48</td>
<td align="left">50299587</td>
<td align="left">35-76</td>
</tr>
<tr class="odd">
<td align="left">S3</td>
<td align="right">22.14</td>
<td align="right">49</td>
<td align="left">67255341</td>
<td align="left">35-76</td>
</tr>
<tr class="even">
<td align="left">S4</td>
<td align="right">19.89</td>
<td align="right">49</td>
<td align="left">67255341</td>
<td align="left">35-76</td>
</tr>
<tr class="odd">
<td align="left">S5</td>
<td align="right">18.15</td>
<td align="right">48</td>
<td align="left">65011962</td>
<td align="left">35-76</td>
</tr>
</tbody>
</table>
</div>
<p>Column names:</p>
<ul>
<li><em>pct.dup</em>: the percentage of duplicate reads,</li>
<li><em>pct.gc</em>: the percentage of GC content,</li>
<li><em>tot.seq</em>: total sequences or the number of reads and</li>
<li><em>seq.length</em>: sequence length or the length of reads.</li>
</ul>
<div class="block">
<p>
The table shows, for each sample, some general statistics such as the total number of reads, the length of reads, the percentage of GC content and the percentage of duplicate reads
</p>
</div>
</div>
</div>
<div id="inspecting-problems" class="section level2">
<h2>Inspecting Problems</h2>
<p>Once you’ve got this aggregated data, it’s easy to figure out what (if anything) is wrong with your data.</p>
<p><strong>1) R functions</strong>. You can inspect problems per either modules or samples using the following R functions:</p>
<ul>
<li><strong>qc_fails</strong>(qc): Displays samples or modules that failed.</li>
<li><strong>qc_warns</strong>(qc): Displays samples or modules that warned.</li>
<li><strong>qc_problems</strong>(qc): Union of <strong>qc_fails</strong>() and <strong>qc_warns</strong>(). Display which samples or modules that failed or warned.</li>
</ul>
<p><strong>2) Input data</strong>: aggregated data from <strong>qc_aggregate</strong>()</p>
<p><strong>3) Output data</strong>: Returns samples or FastQC modules with failures or warnings. By default, these functions return a compact output format. If you want a stretched format, specify the argument <em>compact = FALSE</em>.</p>
<p>The format and the interpretation of the outputs depend on the additional argument <em>element</em>, which value is one of c(“sample”, “module”).</p>
<ul>
<li>If <strong>element = “sample”</strong> (default), results are samples with failed and/or warned modules. The results contain the following columns:
<ul>
<li>sample (sample names),</li>
<li>nb_problems (the number of modules with problems),</li>
<li>module (the name of modules with problems).</li>
</ul></li>
<li>If <strong>element = “module”</strong>, results are modules that failed and/or warned in the most samples. The results contain the following columns:
<ul>
<li>module (the name of module with problems),</li>
<li>nb_problems (the number of samples with problems),</li>
<li>sample (the name of samples with problems)</li>
</ul></li>
</ul>
<div id="per-module-problems" class="section level3">
<h3>Per Module Problems</h3>
<ul>
<li><strong>Modules that failed in the most samples</strong>:</li>
</ul>
<pre class="r"><code># See which module failed in the most samples
qc_fails(qc, "module")</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">module</th>
<th align="right">nb_problems</th>
<th align="left">sample</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Per base sequence content</td>
<td align="right">5</td>
<td align="left">S1, S2, S3, S4, S5</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">2</td>
<td align="left">S3, S4</td>
</tr>
</tbody>
</table>
</div>
<div class="success">
<p>
For each module, the number of problems (failures) and the name of samples, that failed, are shown.
</p>
</div>
<ul>
<li><strong>Modules that warned in the most samples</strong>:</li>
</ul>
<pre class="r"><code># See which module warned in the most samples
qc_warns(qc, "module")</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">module</th>
<th align="right">nb_problems</th>
<th align="left">sample</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Sequence Length Distribution</td>
<td align="right">5</td>
<td align="left">S1, S2, S3, S4, S5</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">3</td>
<td align="left">S1, S2, S5</td>
</tr>
</tbody>
</table>
</div>
<ul>
<li><strong>Modules that failed or warned</strong>: Union of qc_fails() and qc_warns()</li>
</ul>
<pre class="r"><code># See which modules failed or warned.
qc_problems(qc, "module")</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">module</th>
<th align="right">nb_problems</th>
<th align="left">sample</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Per base sequence content</td>
<td align="right">5</td>
<td align="left">S1, S2, S3, S4, S5</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S1, S2, S3, S4, S5</td>
</tr>
<tr class="odd">
<td align="left">Sequence Length Distribution</td>
<td align="right">5</td>
<td align="left">S1, S2, S3, S4, S5</td>
</tr>
</tbody>
</table>
</div>
<p>The output above is in a compact format. For a stretched format, type this:</p>
<pre class="r"><code>qc_problems(qc, "module", compact = FALSE)</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">module</th>
<th align="right">nb_problems</th>
<th align="left">sample</th>
<th align="left">status</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Per base sequence content</td>
<td align="right">5</td>
<td align="left">S1</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">Per base sequence content</td>
<td align="right">5</td>
<td align="left">S2</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">Per base sequence content</td>
<td align="right">5</td>
<td align="left">S3</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">Per base sequence content</td>
<td align="right">5</td>
<td align="left">S4</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">Per base sequence content</td>
<td align="right">5</td>
<td align="left">S5</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S3</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S4</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S1</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S2</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S5</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">Sequence Length Distribution</td>
<td align="right">5</td>
<td align="left">S1</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">Sequence Length Distribution</td>
<td align="right">5</td>
<td align="left">S2</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">Sequence Length Distribution</td>
<td align="right">5</td>
<td align="left">S3</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">Sequence Length Distribution</td>
<td align="right">5</td>
<td align="left">S4</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">Sequence Length Distribution</td>
<td align="right">5</td>
<td align="left">S5</td>
<td align="left">WARN</td>
</tr>
</tbody>
</table>
</div>
<div class="success">
<p>
In the the stretched format each row correspond to a unique sample. Additionally, the status of each module is specified.
</p>
</div>
<p>It’s also possible to display problems for one or more specified modules. For example,</p>
<pre class="r"><code>qc_problems(qc, "module",  name = "Per sequence GC content")</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">module</th>
<th align="right">nb_problems</th>
<th align="left">sample</th>
<th align="left">status</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S3</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S4</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S1</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S2</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">Per sequence GC content</td>
<td align="right">5</td>
<td align="left">S5</td>
<td align="left">WARN</td>
</tr>
</tbody>
</table>
</div>
<div class="warning">
<p>
Note that, partial matching of name is allowed. For example, name = “Per sequence GC content” equates to name = “GC content”.
</p>
</div>
<pre class="r"><code>qc_problems(qc, "module",  name = "GC content")</code></pre>
</div>
<div id="per-sample-problems" class="section level3">
<h3>Per Sample Problems</h3>
<ul>
<li><strong>Samples with one or more failed modules</strong></li>
</ul>
<pre class="r"><code># See which samples had one or more failed modules
qc_fails(qc, "sample")</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">sample</th>
<th align="right">nb_problems</th>
<th align="left">module</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">S3</td>
<td align="right">2</td>
<td align="left">Per base sequence content, Per sequence GC content</td>
</tr>
<tr class="even">
<td align="left">S4</td>
<td align="right">2</td>
<td align="left">Per base sequence content, Per sequence GC content</td>
</tr>
<tr class="odd">
<td align="left">S1</td>
<td align="right">1</td>
<td align="left">Per base sequence content</td>
</tr>
<tr class="even">
<td align="left">S2</td>
<td align="right">1</td>
<td align="left">Per base sequence content</td>
</tr>
<tr class="odd">
<td align="left">S5</td>
<td align="right">1</td>
<td align="left">Per base sequence content</td>
</tr>
</tbody>
</table>
</div>
<div class="success">
<p>
For each sample, the number of problems (failures) and the name of modules, that failed, are shown.
</p>
</div>
<ul>
<li><strong>Samples with failed or warned modules</strong>:</li>
</ul>
<pre class="r"><code># See which samples had one or more module with failure or warning
qc_problems(qc, "sample", compact = FALSE)</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">sample</th>
<th align="right">nb_problems</th>
<th align="left">module</th>
<th align="left">status</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">S1</td>
<td align="right">3</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S1</td>
<td align="right">3</td>
<td align="left">Per sequence GC content</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S1</td>
<td align="right">3</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">S2</td>
<td align="right">3</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">S2</td>
<td align="right">3</td>
<td align="left">Per sequence GC content</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">S2</td>
<td align="right">3</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S3</td>
<td align="right">3</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S3</td>
<td align="right">3</td>
<td align="left">Per sequence GC content</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">S3</td>
<td align="right">3</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
<tr class="even">
<td align="left">S4</td>
<td align="right">3</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="odd">
<td align="left">S4</td>
<td align="right">3</td>
<td align="left">Per sequence GC content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S4</td>
<td align="right">3</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S5</td>
<td align="right">3</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S5</td>
<td align="right">3</td>
<td align="left">Per sequence GC content</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S5</td>
<td align="right">3</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
</tbody>
</table>
</div>
<p>To specify the name of a sample of interest, type this:</p>
<pre class="r"><code>qc_problems(qc, "sample", name = "S1")</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">sample</th>
<th align="right">nb_problems</th>
<th align="left">module</th>
<th align="left">status</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">S1</td>
<td align="right">3</td>
<td align="left">Per base sequence content</td>
<td align="left">FAIL</td>
</tr>
<tr class="even">
<td align="left">S1</td>
<td align="right">3</td>
<td align="left">Per sequence GC content</td>
<td align="left">WARN</td>
</tr>
<tr class="odd">
<td align="left">S1</td>
<td align="right">3</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">WARN</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div id="building-an-html-report" class="section level2">
<h2>Building an HTML Report</h2>
<p>The function <strong>qc_report</strong>() can be used to build a report of FastQC outputs. It creates an HTML file containing FastQC reports of one or multiple samples.</p>
<p>Inputs can be either a directory containing multiple FastQC reports or a single sample FastQC report.</p>
<div id="create-a-multi-qc-report" class="section level3">
<h3>Create a Multi-QC Report</h3>
<p>We’ll build a multi-qc report for the following demo QC directory:</p>
<pre class="r"><code># Demo QC Directory
qc.dir <- system.file("fastqc_results", package = "fastqcr")
qc.dir</code></pre>
<pre><code>## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/fastqcr/fastqc_results"</code></pre>
<pre class="r"><code># Build a report
qc_report(qc.dir, result.file = "~/Desktop/multi-qc-result",
          experiment = "Exome sequencing of colon cancer cell lines")</code></pre>
<div class="success">
<p>
An example of report is available at: <a href= "https://www.sthda.com/english/rpkgs/fastqcr/qc-reports/fastqcr-multi-qc-report.html", target = "_blank"> fastqcr multi-qc report</a>
</p>
</div>
</div>
<div id="create-a-one-sample-report" class="section level3">
<h3>Create a One-Sample Report</h3>
<p>We’ll build a report for the following demo QC file:</p>
<pre class="r"><code> qc.file <- system.file("fastqc_results", "S1_fastqc.zip", package = "fastqcr")
qc.file</code></pre>
<pre><code>## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/fastqcr/fastqc_results/S1_fastqc.zip"</code></pre>
<ul>
<li><strong>One-Sample QC report with plot interpretations</strong>:</li>
</ul>
<pre class="r"><code> qc_report(qc.file, result.file = "one-sample-report-with-interpretation",
   interpret = TRUE)</code></pre>
<div class="success">
<p>
An example of report is available at: <a href= "https://www.sthda.com/english/rpkgs/fastqcr/qc-reports/sample-qc-report-interpretation.html", target = "_blank"> One sample QC report with interpretation</a>
</p>
</div>
<ul>
<li><strong>One-Sample QC report without plot interpretations</strong>:</li>
</ul>
<pre class="r"><code> qc_report(qc.file, result.file = "one-sample-report",
   interpret = FALSE)</code></pre>
<div class="success">
<p>
An example of report is available at: <a href= "https://www.sthda.com/english/rpkgs/fastqcr/qc-reports/sample-qc-report-without-interpretation.html", target = "_blank"> One sample QC report without interpretation</a>
</p>
</div>
</div>
</div>
<div id="importing-and-plotting-a-fastqc-qc-report" class="section level2">
<h2>Importing and Plotting a FastQC QC Report</h2>
<p>We’ll visualize the output for sample 1:</p>
<pre class="r"><code># Demo file
qc.file <- system.file("fastqc_results", "S1_fastqc.zip",  package = "fastqcr")
qc.file</code></pre>
<pre><code>## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library/fastqcr/fastqc_results/S1_fastqc.zip"</code></pre>
<p>We start by reading the output using the function <strong>qc_read</strong>(), which returns a list of tibbles containing the data for specified modules:</p>
<pre class="r"><code># Read all modules
qc <- qc_read(qc.file)
# Elements contained in the qc object
names(qc)</code></pre>
<pre><code>##  [1] "summary"                       "basic_statistics"              "per_base_sequence_quality"     "per_tile_sequence_quality"    
##  [5] "per_sequence_quality_scores"   "per_base_sequence_content"     "per_sequence_gc_content"       "per_base_n_content"           
##  [9] "sequence_length_distribution"  "sequence_duplication_levels"   "overrepresented_sequences"     "adapter_content"              
## [13] "kmer_content"                  "total_deduplicated_percentage"</code></pre>
<p>The function <strong>qc_plot</strong>() is used to visualized the data of a specified module. Allowed values for the argument modules include one or the combination of:</p>
<ul>
<li>“Summary”,</li>
<li>“Basic Statistics”,</li>
<li>“Per base sequence quality”,</li>
<li>“Per sequence quality scores”,</li>
<li>“Per base sequence content”,</li>
<li>“Per sequence GC content”,</li>
<li>“Per base N content”,</li>
<li>“Sequence Length Distribution”,</li>
<li>“Sequence Duplication Levels”,</li>
<li>“Overrepresented sequences”,</li>
<li>“Adapter Content”</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Per sequence GC content")

qc_plot(qc, "Per base sequence quality")

qc_plot(qc, "Per sequence quality scores")

qc_plot(qc, "Per base sequence content")

qc_plot(qc, "Sequence duplication levels")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-fastqc-plot-1.png" alt="fastqcr" width="336" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-fastqc-plot-2.png" alt="fastqcr" width="336" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-fastqc-plot-3.png" alt="fastqcr" width="336" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-fastqc-plot-4.png" alt="fastqcr" width="336" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-fastqc-plot-5.png" alt="fastqcr" width="336" />
<p class="caption">
fastqcr
</p>
</div>
</div>
<div id="interpreting-fastqc-reports" class="section level2">
<h2>Interpreting FastQC Reports</h2>
<ul>
<li><strong>Summary</strong> shows a summary of the modules which were tested, and the status of the test results:
<ul>
<li>normal results (PASS),</li>
<li>slightly abnormal (WARN: warning)</li>
<li>or very unusual (FAIL: failure).</li>
</ul></li>
</ul>
<p>Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look normal.</p>
<pre class="r"><code>qc_plot(qc, "summary")</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">status</th>
<th align="left">module</th>
<th align="left">sample</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">PASS</td>
<td align="left">Basic Statistics</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="even">
<td align="left">PASS</td>
<td align="left">Per base sequence quality</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="odd">
<td align="left">PASS</td>
<td align="left">Per tile sequence quality</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="even">
<td align="left">PASS</td>
<td align="left">Per sequence quality scores</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="odd">
<td align="left">FAIL</td>
<td align="left">Per base sequence content</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="even">
<td align="left">WARN</td>
<td align="left">Per sequence GC content</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="odd">
<td align="left">PASS</td>
<td align="left">Per base N content</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="even">
<td align="left">WARN</td>
<td align="left">Sequence Length Distribution</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="odd">
<td align="left">PASS</td>
<td align="left">Sequence Duplication Levels</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="even">
<td align="left">PASS</td>
<td align="left">Overrepresented sequences</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="odd">
<td align="left">PASS</td>
<td align="left">Adapter Content</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="even">
<td align="left">PASS</td>
<td align="left">Kmer Content</td>
<td align="left">S1.fastq</td>
</tr>
</tbody>
</table>
</div>
<ul>
<li><strong>Basic statistics</strong> shows basic data metrics such as:
<ul>
<li>Total sequences: the number of reads (total sequences),</li>
<li>Sequence length: the length of reads (minimum - maximum)</li>
<li>%GC: GC content</li>
</ul></li>
</ul>
<pre class="r"><code>qc_plot(qc, "Basic statistics")</code></pre>
<div class="kable-table">
<table>
<thead>
<tr class="header">
<th align="left">Measure</th>
<th align="left">Value</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Filename</td>
<td align="left">S1.fastq</td>
</tr>
<tr class="even">
<td align="left">File type</td>
<td align="left">Conventional base calls</td>
</tr>
<tr class="odd">
<td align="left">Encoding</td>
<td align="left">Sanger / Illumina 1.9</td>
</tr>
<tr class="even">
<td align="left">Total Sequences</td>
<td align="left">50299587</td>
</tr>
<tr class="odd">
<td align="left">Sequences flagged as poor quality</td>
<td align="left">0</td>
</tr>
<tr class="even">
<td align="left">Sequence length</td>
<td align="left">35-76</td>
</tr>
<tr class="odd">
<td align="left">%GC</td>
<td align="left">48</td>
</tr>
</tbody>
</table>
</div>
<ul>
<li><strong>Per base sequence quality</strong> plot depicts the quality scores across all bases at each position in the reads. The background color delimits 3 different zones: very good quality (green), reasonable quality (orange) and poor quality (red). A good sample will have qualities all above 28:</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Per base sequence quality")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-per-base-sequence-quality-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<p>Problems:</p>
<div class="warning">
<ul>
<li>
<strong>warning</strong> if the median for any base is less than 25.
</li>
<li>
<strong>failure</strong> if the median for any base is less than 20.
</li>
</ul>
</div>
<p>Common reasons for problems:</p>
<div class="block">
<ul>
<li>
<p>
Degradation of (sequencing chemistry) quality over the duration of long runs. Remedy: Quality trimming.
</p>
</li>
<li>
<p>
Short loss of quality earlier in the run, which then recovers to produce later good quality sequence. Can be explained by a transient problem with the run (bubbles in the flowcell for example). In these cases trimming is not advisable as it will remove later good sequence, but you might want to consider masking bases during subsequent mapping or assembly.
</p>
</li>
<li>
<p>
Library with reads of varying length. Warning or error is generated because of very low coverage for a given base range. Before committing to any action, check how many sequences were responsible for triggering an error by looking at the sequence length distribution module results.
</p>
</li>
</ul>
</div>
<ul>
<li><strong>Per sequence quality scores</strong> plot shows the frequencies of quality scores in a sample. It allows you to see if a subset of your sequences have low quality values. If the reads are of good quality, the peak on the plot should be shifted to the right as far as possible (quality > 27).</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Per sequence quality scores")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-per-sequence-quality-scores-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<p>Problems:</p>
<div class="warning">
<ul>
<li>
<strong>warning</strong> if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate.
</li>
<li>
<strong>failure</strong> if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.
</li>
</ul>
</div>
<p>Common reasons for problems:</p>
<div class="block">
<p>
General loss of quality within a run. Remedy: For long runs this may be alleviated through quality trimming.
</p>
</div>
<ul>
<li><strong>Per base sequence content</strong> shows the four nucleotides’ proportions for each position. In a random library you expect no nucleotide bias and the lines should be almost parallel with each other. In a good sequence composition, the difference between A and T, or G and C is < 10% in any position.</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Per base sequence content")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-per-base-sequence-content-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<div class="notice">
<p>
It’s worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. For example, in RNA-Seq data, it is common to have bias at the beginning of the reads. This occurs during RNA-Seq library preparation, when “random” primers are annealed to the start of sequences. These primers are not truly random, and it leads to a variation at the beginning of the reads. We can remove these primers using a trim adaptors tool.
</p>
</div>
<p>Problems:</p>
<div class="warning">
<ul>
<li>
<strong>warning</strong> if the difference between A and T, or G and C is greater than 10% in any position.

</li>
<li>
<strong>failure</strong> if the difference between A and T, or G and C is greater than 20% in any position.
</li>
</ul>
</div>
<p>Common reasons for problems:</p>
<div class="block">
<ul>
<li>
<p>
Overrepresented sequences: adapter dimers or rRNA
</p>
</li>
<li>
<p>
Biased selection of random primers for RNA-seq. Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn’t seem to adversely affect the ability to measure expression.
</p>
</li>
<li>
<p>
Biased composition libraries: Some libraries are inherently biased in their sequence composition. For example, library treated with sodium bisulphite, which will then converted most of the cytosines to thymines, meaning that the base composition will be almost devoid of cytosines and will thus trigger an error, despite this being entirely normal for that type of library.
</p>
</li>
<li>
<p>
Library which has been aggressively adapter trimmed.
</p>
</li>
</ul>
</div>
<ul>
<li><strong>Per sequence GC content</strong> plot displays GC distribution over all sequences. In a random library you expect a roughly normal GC content distribution. An unusually sharped or shifted distribution could indicate a contamination or some systematic biases:</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Per sequence GC content")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-per-sequence-GC-content-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<div class="success">
<p>
You can generate the theoretical GC content curves files using an R package called <a href="https://github.com/mikelove/fastqcTheoreticalGC">fastqcTheoreticalGC</a> written by Mike Love.
</p>
</div>
<ul>
<li><strong>Per base N content</strong>. If a sequencer is unable to make a base call with sufficient confidence then it will normally substitute an N rather than a conventional base call. This module plots out the percentage of base calls at each position for which an N was called.</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Per base N content")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-per-base-N-content-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<p>Problems:</p>
<div class="warning">
<ul>
<li>
<strong>warning</strong> if any position shows an N content of >5%.
</li>
<li>
<strong>failure</strong> if any position shows an N content of >20%.
</li>
</ul>
</div>
<p>Common reasons for problems:</p>
<div class="block">
<ul>
<li>
General loss of quality.
</li>
<li>
Very biased sequence composition in the library.
</li>
</ul>
</div>
<ul>
<li><strong>Sequence length distribution</strong> module reports if all sequences have the same length or not. For some sequencing platforms it is entirely normal to have different read lengths so warnings here can be ignored. In many cases this will produce a simple graph showing a peak only at one size. This module will raise an error if any of the sequences have zero length.</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Sequence length distribution")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-sequence-length-distribution-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<ul>
<li><strong>Sequence duplication levels</strong>. This module counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication. A high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Sequence duplication levels")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-sequence-duplication-levels-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<p>Problems:</p>
<div class="warning">
<ul>
<li>
<strong>warning</strong> if non-unique sequences make up more than 20% of the total.
</li>
<li>
<strong>failure</strong> if non-unique sequences make up more than 50% of the total.
</li>
</ul>
</div>
<p>Common reasons for problems:</p>
<div class="block">
<ul>
<li>
<p>
Technical duplicates arising from PCR artifacts
</p>
</li>
<li>
<p>
Biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected.
</p>
</li>
</ul>
<p>
In RNA-seq data, duplication levels can reach even 40%. Nevertheless, while analyzing transcriptome sequencing data, we should not remove these duplicates because we do not know whether they represent PCR duplicates or high gene expression of our samples.
</p>
</div>
<ul>
<li><strong>Overrepresented sequences</strong> section gives information about primer or adaptor contaminations. Finding that a single sequence is very overrepresented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as you expected. This module lists all of the sequence which make up more than 0.1% of the total.</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Overrepresented sequences")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-Overrepresented-sequences-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<p>Problems:</p>
<div class="warning">
<ul>
<li>
<strong>warning</strong> if any sequence is found to represent more than 0.1% of the total.
</li>
<li>
<strong>failure</strong> if any sequence is found to represent more than 1% of the total.
</li>
</ul>
</div>
<p>Common reasons for problems:</p>
<div class="block">
<p>
small RNA libraries where sequences are not subjected to random fragmentation, and the same sequence may naturally be present in a significant proportion of the library.
</p>
</div>
<ul>
<li><strong>Adapter content</strong> module checks the presence of read-through adapter sequences. It is useful to know if your library contains a significant amount of adapter in order to be able to assess whether you need to adapter trim or not.</li>
</ul>
<pre class="r"><code>qc_plot(qc, "Adapter content")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-adapter-content-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
<p>Problems:</p>
<div class="warning">
<ul>
<li>
<strong>warning</strong> if any sequence is present in more than 5% of all reads.
</li>
<li>
<strong>failure</strong> if any sequence is present in more than 10% of all reads.
</li>
</ul>
</div>
<div class="block">
<p>
A warning or failure means that the sequences will need to be adapter trimmed before proceeding with any downstream analysis.
</p>
</div>
<ul>
<li><strong>K-mer content</strong></li>
</ul>
<pre class="r"><code>qc_plot(qc, "Kmer content")</code></pre>
<div class="figure" style="text-align: center">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/r-packages/fastqcr/fastqcr-kmer-content-1.png" alt="fastqcr" width="384" />
<p class="caption">
fastqcr
</p>
</div>
</div>
<div id="useful-links" class="section level2">
<h2>Useful Links</h2>
<ul>
<li>FastQC report for a <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html">good Illumina dataset</a></li>
<li>FastQC report for a <a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html">bad Illumina dataset</a></li>
<li><a href="http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/">Online documentation for each FastQC report</a></li>
</ul>
</div>
<div id="infos" class="section level2">
<h2>Infos</h2>
<p><span class="warning"> This analysis has been performed using <strong>R software</strong> (ver. 3.3.2). </span></p>
</div>

<script>jQuery(document).ready(function () {
    jQuery('#rdoc h1').addClass('wiki_paragraph1');
    jQuery('#rdoc h2').addClass('wiki_paragraph2');
    jQuery('#rdoc h3').addClass('wiki_paragraph3');
    jQuery('#rdoc h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Tue, 11 Apr 2017 20:54:25 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Genomics]]></title>
			<link>https://www.sthda.com/english/wiki/genomics</link>
			<guid>https://www.sthda.com/english/wiki/genomics</guid>
			<description><![CDATA[Genomics data analysis : gene expression, miRNA expression, RNA and DNA sequencing, Chip sequensing<br />
<br />
<!-- START HTML -->

            
           
            
  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<header><h1>
CHAPTER I : R basics and exploratory data analysis
</h1></header>

<ol style="list-style-type: decimal">
<li><a href="https://www.sthda.com/english/english/wiki/what-we-measure-and-why">What we measure and why</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/r-programming-skills">R programming skills</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/exploratory-data-analysis">Exploratory data analysis (EDA)</a></li>
</ol>


<header><h1>
CHAPTER II : Basic bioconductor infrastructure
</h1></header>

<ol style="list-style-type: decimal">
<li><a href="https://www.sthda.com/english/english/wiki/iranges">IRanges</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/granges-and-grangeslist">GRanges and GRangesList</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/expressionset-and-summarizedexperiment">ExpressionSet and SummarizedExperiment</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/installing-from-github-com">Installing from github.com</a></li>
</ol>


<header><h1>
CHAPTER III : Microarray data
</h1></header>

<ol style="list-style-type: decimal">
<li><a href="https://www.sthda.com/english/english/wiki/affymetrix-cel-files">Affymetrix CEL files</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/agilent-data">Agilent data</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/eda-for-microarray-data">EDA for microarray data</a></li>
</ol>


<header><h1>
CHAPTER IV : High-throughput Sequencing
</h1></header>

<ul>
<li><a href="https://www.sthda.com/english/english/wiki/mapping-algorithms-and-softwares">Mapping algorithms and softwares</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/tophat2-download-build-reference-genome-and-align-the-reads-to-the-reference-genome">Tophat2 : Download, build reference genome and align the reads to the reference genome</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/exploratory-data-analysis-for-next-generation-sequencing">Exploratory data analysis for next generation sequencing</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/read-counting-ngs">Read counting - NGS</a></li>
</ul>


<header><h1>
CHAPTER V : Visualizing next geration sequencing data
</h1></header>

<p>We will try four ways to look at NGS coverage: using the standalone <code>Java program IGV</code>, using simple <code>plot</code> commands, and using the <code>Gviz</code> and <code>ggbio</code> packages in Bioconductor.</p>
<ol style="list-style-type: decimal">
<li><a href="https://www.sthda.com/english/english/wiki/igv-integrative-genomics-viewer">IGV - Integrative Genomics Viewer</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/visualize-ngs-data-with-r-and-bioconductor">Visualize NGS data with R and Bioconductor</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/ggbio-visualize-genomic-data">ggbio - Visualize genomic data</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/gviz-visualize-genomic-data">Gviz - Visualize genomic data</a></li>
</ol>


<header><h1>
CHAPTER VII : RNA-sequencing
</h1></header>
  
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/introduction-to-rna-sequencing">Introduction to RNA sequencing</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/rna-sequencing-data-analysis-alignment-and-reads-counting-using-cufflinks">RNA sequencing data analysis - alignment and reads counting using cufflinks</a>
</li>
<li><a href="https://www.sthda.com/english/english/wiki/rna-sequencing-data-analysis-counting-normalization-and-differential-expression">RNA sequencing data analysis - Counting, normalization and differential expression</a>
</li>
<li><a href="https://www.sthda.com/english/english/wiki/rna-seq-differential-expression-work-flow-using-deseq2">RNA-Seq differential expression work flow using DESeq2</a></li>
</ul>


<header><h1>
CHAPTER VIII : Genomic ToolKits
</h1></header>
<ul>
<li><a href="https://www.sthda.com/english/english/wiki/install-sra-toolkit">Install SRA toolkit</a></li>
<li><a href="https://www.sthda.com/english/english/wiki/sra-to-fastq-file">SRA to FASTQ file</a></li>
</ul>

<script>jQuery(document).ready(function () {jQuery('h1,h2,h3,h4').addClass('formatter-title');});//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->



<!-- END HTML -->]]></description>
			<pubDate>Tue, 14 Oct 2014 00:17:34 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[MIT licence]]></title>
			<link>https://www.sthda.com/english/wiki/mit-licence</link>
			<guid>https://www.sthda.com/english/wiki/mit-licence</guid>
			<description><![CDATA[Copyright (c) 2013 Rafael Irizarry and Michael Love<br />
<br />
Permission is hereby granted, free of charge, to any person obtaining a copy<br />
of this software and associated documentation files (the "Software"), to deal<br />
in the Software without restriction, including without limitation the rights<br />
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell<br />
copies of the Software, and to permit persons to whom the Software is<br />
furnished to do so, subject to the following conditions:<br />
<br />
The above copyright notice and this permission notice shall be included in all<br />
copies or substantial portions of the Software.<br />
<br />
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR<br />
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,<br />
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE<br />
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER<br />
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,<br />
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE<br />
SOFTWARE.]]></description>
			<pubDate>Wed, 24 Sep 2014 16:34:36 +0200</pubDate>
			
		</item>
		
	</channel>
</rss>
