<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Fri, 01 May 2026 01:54:25 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Easy Guides]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/wiki/14" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles of the category: High-throughput Sequencing]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[Tophat2 : Download, build reference genome and align the reads to the reference genome]]></title>
			<link>https://www.sthda.com/english/wiki/tophat2-download-build-reference-genome-and-align-the-reads-to-the-reference-genome</link>
			<guid>https://www.sthda.com/english/wiki/tophat2-download-build-reference-genome-and-align-the-reads-to-the-reference-genome</guid>
			<description><![CDATA[<!-- START HTML -->

 <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#objectives">Objectives</a></li>
<li><a href="#download-data">Download data</a></li>
<li><a href="#download-the-reference-genome">Download the reference genome</a><ul>
<li><a href="#download-reference-genome-from-ensembl">Download reference genome from Ensembl</a></li>
<li><a href="#download-reference-genome-from-ucsc">Download reference genome from UCSC</a></li>
</ul></li>
<li><a href="#get-gene-model-annotations">Get gene model annotations</a></li>
<li><a href="#build-the-reference-index">Build the reference index</a></li>
<li><a href="#workflow">Workflow</a><ul>
<li><a href="#assess-sequence-quality-control-with-shortread">Assess sequence quality control with ShortRead</a></li>
<li><a href="#prepare-your-sample-metadata-informations">Prepare your sample metadata informations</a></li>
<li><a href="#align-the-reads-to-reference-genome-using-tophat2">Align the reads to reference genome using tophat2</a></li>
<li><a href="#use-r-to-create-the-list-of-shell-commands">Use R to create the list of shell commands :</a></li>
<li><a href="#sort-and-index-the-bam-files-and-create-sam-files">Sort and index the BAM files and create SAM files</a></li>
</ul></li>
<li><a href="#inspect-alignments-with-igv">Inspect alignments with IGV</a></li>
</ul>
</div>

<hr />
<p><span class="warning"> This analysis was performed using R (ver. 3.1.0). </span></p>
<div id="objectives" class="section level2">
<h2>Objectives</h2>
<p>The aim of this article is to show you:</p>
<ul>
<li>How to download and build reference genome?
</li>
<li>How to align the reads using Tophat2?
</li>
<li>How to inspect BAM file using IGV?</li>
</ul>
</div>
<div id="download-data" class="section level2">
<h2>Download data</h2>
<p>The data from <a href="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57478">Popovic et al., (GEO accession number: GSE57478)</a> where used in the following example. The SRA files are available here : <a href="http://www.ncbi.nlm.nih.gov/sra?term=SRP032510"><a href="http://www.ncbi.nlm.nih.gov/sra?term=SRP032510">http://www.ncbi.nlm.nih.gov/sra?term=SRP032510</a></a>.</p>
<p><span class="warning"> Download the SRA files and convert to FASTQ files. This is described <a href="https://www.sthda.com/english/english/wiki/from-sra-to-fastq-file">here</a>. </span></p>
</div>
<div id="download-the-reference-genome" class="section level2">
<h2>Download the reference genome</h2>
<ul>
<li>Create a directory called “genome/” to save your reference genome.</li>
</ul>
<p><span class="success">Note that bowtie2/tophat2 indices for many commonly used reference genomes can be downloaded directly from <a href="http://tophat.cbcb.umd.edu/igenomes.html">http://tophat.cbcb.umd.edu/igenomes.html</a>.</span></p>
<p>Reference genome index (from FASTA file) for bowtie2/tophat2, can be build by following the explanation down below.</p>
<p>User have to download the reference genome sequence for the organism under study in (compressed) FASTA format. This can be done from Ensembl and UCSC databases among many others.</p>
<p>Ensembl FTP server : <a href="http://www.ensembl.org/info/data/ftp/index.html">http://www.ensembl.org/info/data/ftp/index.html</a>
UCSC FTP server : <a href="ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/">ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/</a></p>
<p><span class="notice"> For Ensembl, choose the <strong>FASTA (DNA)</strong> link instead of <strong>FASTA (cDNA)</strong>, since alignments to the genome, not the transcriptome, are desired. </span></p>
<div id="download-reference-genome-from-ensembl" class="section level3">
<h3>Download reference genome from Ensembl</h3>
<p>In this article, homo sapiens reference genome from Ensembl database is used. For homo sapiens the file labeled <strong>toplevel</strong> combines all chromosomes. Download and uncompress the reference genome, using the following UNIX commands :</p>
<p><span class="bash"> wget <a href="ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz">ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz</a> gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz </span></p>
</div>
<div id="download-reference-genome-from-ucsc" class="section level3">
<h3>Download reference genome from UCSC</h3>
<ul>
<li><a href="ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/">Click here to download</a> the data corresponding to your organism of interest</li>
<li>Click on “Homo sapiens” -> bigZips -> download the “chromFa.tar.gz”</li>
</ul>
<p>UNIX command :</p>
<p><span class="bash"> wget <a href="ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/chromFa.tar.gz">ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/chromFa.tar.gz</a> gunzip chromFa.tar.gz </span></p>
</div>
</div>
<div id="get-gene-model-annotations" class="section level2">
<h2>Get gene model annotations</h2>
<p>Download a GTF file with gene models for the organism of interest.</p>
<p>Homo sapiens gene model annotation can be downloaded as follow:</p>
<p><span class="bash"> wget <a href="ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz">ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz</a> gunzip Homo_sapiens.GRCh38.77.gtf.gz </span></p>
<p><span class="warning"> Always download the FASTA reference sequence and the GTF annotation data from the same resource provider. </span></p>
</div>
<div id="build-the-reference-index" class="section level2">
<h2>Build the reference index</h2>
<p>Before reads can be aligned, the reference FASTA files need to be preprocessed into an index that allows the aligner easy access.</p>
<p>To build a bowtie2-specific index from the FASTA file use the command:</p>
<p><span class="bash">bowtie2-build -f Homo_sapiens.GRCh38.dna.toplevel.fa. Homo_sapiens_GRCh38</span></p>
<p>A set of BT2 files will be produced, with names starting with Homo_sapiens_GRCh38 as specified.</p>
<p><span class="error"> This procedure needs to be run only once for each reference genome used. As mentioned, pre-built indices for many commonly used genomes are available from <a href="http://tophat.cbcb.umd.edu/igenomes.html">http://tophat.cbcb.umd.edu/igenomes.html</a>. </span></p>
</div>
<div id="workflow" class="section level2">
<h2>Workflow</h2>
<div id="assess-sequence-quality-control-with-shortread" class="section level3">
<h3>Assess sequence quality control with ShortRead</h3>
<ol style="list-style-type: decimal">
<li><p>From R, set your working directory (using `setwd’ command) to the directory where FASTQ file are situated.</p></li>
<li><p>Run the folllowing R code :</p></li>
</ol>
<pre class="r"><code>library("ShortRead")
QC = qa(dirPath=".", pattern=".fastq$", type="fastq")
report(QC, type="html", dest="fastqQAreport")</code></pre>
<p><span class="success"> The output is an html files located in the “fastqQAreport” directory.</span></p>
<ol start="3" style="list-style-type: decimal">
<li>Use a web browser to inspect the generated HTML file.</li>
</ol>
</div>
<div id="prepare-your-sample-metadata-informations" class="section level3">
<h3>Prepare your sample metadata informations</h3>
<pre class="r"><code># Initial SRA info table
sri<-read.csv("SraRunInfo.csv", stringsAsFactors=FALSE)
# Prepare sample information table
#Add important descriptive columns
samples=as.data.frame(
  list(
     name=c(&amp;#39;KMS11_GSK343_1&amp;#39;, &amp;#39;KMS11_GSK343_2&amp;#39;,&amp;#39;TKO_GSK343_1&amp;#39;, &amp;#39;TKO_GSK343_2&amp;#39;,
           &amp;#39;KMS11_GSK669_1&amp;#39;, &amp;#39;KMS11_GSK669_2&amp;#39;, &amp;#39;TKO_GSK669_1&amp;#39;, &amp;#39;TKO_GSK669_2&amp;#39;),
     fastq=paste(sri$Run, ".fastq", sep=""),
     avgLength=sri$avgLength,
     data_type=sri$LibraryStrategy,
     layout = sri$LibraryLayout,
     geoid=sri$SampleName,
     cell_line=c(1, 1, 2, 2, 1,1,2,2),
     treatment=c("ctr", "ctr", "ctr", "ctr", "ezh2i", "ezh2i", "ezh2i", "ezh2i")
    )
  )
rownames(samples)=samples$name
samples[sort(rownames(samples)), ]</code></pre>
<pre><code>##                          name            fastq avgLength data_type layout      geoid cell_line treatment
## KMS11_GSK343_1 KMS11_GSK343_1 SRR1282056.fastq        51   RNA-Seq SINGLE GSM1383539         1       ctr
## KMS11_GSK343_2 KMS11_GSK343_2 SRR1282057.fastq        51   RNA-Seq SINGLE GSM1383540         1       ctr
## KMS11_GSK669_1 KMS11_GSK669_1 SRR1282060.fastq        51   RNA-Seq SINGLE GSM1383543         1     ezh2i
## KMS11_GSK669_2 KMS11_GSK669_2 SRR1282061.fastq        51   RNA-Seq SINGLE GSM1383544         1     ezh2i
## TKO_GSK343_1     TKO_GSK343_1 SRR1282058.fastq        51   RNA-Seq SINGLE GSM1383541         2       ctr
## TKO_GSK343_2     TKO_GSK343_2 SRR1282059.fastq        51   RNA-Seq SINGLE GSM1383542         2       ctr
## TKO_GSK669_1     TKO_GSK669_1 SRR1282062.fastq        51   RNA-Seq SINGLE GSM1383545         2     ezh2i
## TKO_GSK669_2     TKO_GSK669_2 SRR1282063.fastq        51   RNA-Seq SINGLE GSM1383546         2     ezh2i</code></pre>
<pre class="r"><code>head(samples)</code></pre>
<pre><code>##                          name            fastq avgLength data_type layout      geoid cell_line treatment
## KMS11_GSK343_1 KMS11_GSK343_1 SRR1282056.fastq        51   RNA-Seq SINGLE GSM1383539         1       ctr
## KMS11_GSK343_2 KMS11_GSK343_2 SRR1282057.fastq        51   RNA-Seq SINGLE GSM1383540         1       ctr
## TKO_GSK343_1     TKO_GSK343_1 SRR1282058.fastq        51   RNA-Seq SINGLE GSM1383541         2       ctr
## TKO_GSK343_2     TKO_GSK343_2 SRR1282059.fastq        51   RNA-Seq SINGLE GSM1383542         2       ctr
## KMS11_GSK669_1 KMS11_GSK669_1 SRR1282060.fastq        51   RNA-Seq SINGLE GSM1383543         1     ezh2i
## KMS11_GSK669_2 KMS11_GSK669_2 SRR1282061.fastq        51   RNA-Seq SINGLE GSM1383544         1     ezh2i</code></pre>
<p><span class="success"> Since the downstream statistical analysis of differential expression relies on this table, carefully inspect (and correct, if necessary) the metadata table. </span></p>
</div>
<div id="align-the-reads-to-reference-genome-using-tophat2" class="section level3">
<h3>Align the reads to reference genome using tophat2</h3>
</div>
<div id="use-r-to-create-the-list-of-shell-commands" class="section level3">
<h3>Use R to create the list of shell commands :</h3>
<pre class="r"><code>genome="genome/Homo_sapiens_GRCh38"
cmd=with(samples, paste("tophat2 -o ", name, " -p 10  ",genome, " ", fastq, sep="" ))
for(c in cmd) system(c)</code></pre>
<pre><code>## tophat2 -o KMS11_GSK343_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282056.fastq 
## tophat2 -o KMS11_GSK343_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282057.fastq 
## tophat2 -o TKO_GSK343_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282058.fastq 
## tophat2 -o TKO_GSK343_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282059.fastq 
## tophat2 -o KMS11_GSK669_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282060.fastq 
## tophat2 -o KMS11_GSK669_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282061.fastq 
## tophat2 -o TKO_GSK669_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282062.fastq 
## tophat2 -o TKO_GSK669_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282063.fastq</code></pre>
<p><span class="error"> In the call to tophat2, the option -o specifies the output directory, -p specifies the number of threads to use (affect run times, vary depending on the resources available). The first argument is the name of the index. The second argument is a list of all FASTQ files. </span></p>
<p><strong>Other parameters can be specified to tophat2 :</strong></p>
<p><span class="bash"> tophat2 -G Homo_sapiens.GRCh38.77.gtf -o output_dir -p 10 –-no-coverage-search genome/Homo_sapiens_GRCh38 file.fastq </span></p>
<p><span class="notice"> - The option -G points tophat2 to a GTF file of annotation to facilitate mapping reads across exon-exon junctions (some of witch can be found de novo).</p>
<ul>
<li>The coverage-search algorithm was turned off because it took too long. </span></li>
</ul>
<p>Note that the FASTQ files are concatenated with commas, without spaces. For experiments with paired-end reads, pairs of FASTQ files are given as separate arguments and the order in both arguments must match.</p>
<ul>
<li>paired-end sequencing</li>
</ul>
<p><span class="bash"> tophat2 -o tophat_output_dir -p 8 –no-coverage-search /path/to/genome/Bowtie2Index/genome file_1.fastq file_2.fastq samtools sort -n file_tophat_out/accepted_hits.bam _sorted </span></p>
<ul>
<li>Single-end sequencing <span class="bash"> tophat2 -o tophat_output_dir -p 8 /path/to/genome/Bowtie2Index/genome file.fastq samtools sort -n file_tophat_out/accepted_hits.bam _sorted </span></li>
</ul>
<div id="run-the-command" class="section level4">
<h4>Run the command</h4>
<p>The commands can be executed by copy and paste in UNIX terminal. <code>system</code> function can be used to execute these command direct from R. The list of the commands can also be stored in a text file and use UNIX source command.</p>
</div>
</div>
<div id="sort-and-index-the-bam-files-and-create-sam-files" class="section level3">
<h3>Sort and index the BAM files and create SAM files</h3>
<p>BAM files are organized into a single directory. SAM files is required when you want to count reads with htseq-count.</p>
<pre class="r"><code>for(i in 1:nrow(samples)){
  lib=samples$name[i]
  bamFile=file.path(lib, "accepted_hits.bam")
  #sort by name and convert to SAM for htseq-count counting
  system(paste0("samtools sort -n ",bamFile," ",lib,"_sn")) #sort
  system(paste0("samtools view -o ",lib,"_sn.sam ",lib,"_sn.bam")) #convert to sam
 
 # sort by position and index for IGV
 system(paste0("samtools sort ",bamFile," ",lib,"_s"))
 system(paste0("samtools index ",lib,"_s.bam"))
}</code></pre>
<p>The following code is executed :</p>
<pre><code>## samtools sort -n KMS11_GSK343_1/accepted_hits.bam KMS11_GSK343_1_sn 
## samtools view -o KMS11_GSK343_1_sn.sam KMS11_GSK343_1_sn.bam 
## samtools sort KMS11_GSK343_1/accepted_hits.bam KMS11_GSK343_1_s 
## samtools index KMS11_GSK343_1_s.bam 
## 
## samtools sort -n KMS11_GSK343_2/accepted_hits.bam KMS11_GSK343_2_sn 
## samtools view -o KMS11_GSK343_2_sn.sam KMS11_GSK343_2_sn.bam 
## samtools sort KMS11_GSK343_2/accepted_hits.bam KMS11_GSK343_2_s 
## samtools index KMS11_GSK343_2_s.bam 
## 
## samtools sort -n TKO_GSK343_1/accepted_hits.bam TKO_GSK343_1_sn 
## samtools view -o TKO_GSK343_1_sn.sam TKO_GSK343_1_sn.bam 
## samtools sort TKO_GSK343_1/accepted_hits.bam TKO_GSK343_1_s 
## samtools index TKO_GSK343_1_s.bam 
## 
## samtools sort -n TKO_GSK343_2/accepted_hits.bam TKO_GSK343_2_sn 
## samtools view -o TKO_GSK343_2_sn.sam TKO_GSK343_2_sn.bam 
## samtools sort TKO_GSK343_2/accepted_hits.bam TKO_GSK343_2_s 
## samtools index TKO_GSK343_2_s.bam 
## 
## samtools sort -n KMS11_GSK669_1/accepted_hits.bam KMS11_GSK669_1_sn 
## samtools view -o KMS11_GSK669_1_sn.sam KMS11_GSK669_1_sn.bam 
## samtools sort KMS11_GSK669_1/accepted_hits.bam KMS11_GSK669_1_s 
## samtools index KMS11_GSK669_1_s.bam 
## 
## samtools sort -n KMS11_GSK669_2/accepted_hits.bam KMS11_GSK669_2_sn 
## samtools view -o KMS11_GSK669_2_sn.sam KMS11_GSK669_2_sn.bam 
## samtools sort KMS11_GSK669_2/accepted_hits.bam KMS11_GSK669_2_s 
## samtools index KMS11_GSK669_2_s.bam 
## 
## samtools sort -n TKO_GSK669_1/accepted_hits.bam TKO_GSK669_1_sn 
## samtools view -o TKO_GSK669_1_sn.sam TKO_GSK669_1_sn.bam 
## samtools sort TKO_GSK669_1/accepted_hits.bam TKO_GSK669_1_s 
## samtools index TKO_GSK669_1_s.bam 
## 
## samtools sort -n TKO_GSK669_2/accepted_hits.bam TKO_GSK669_2_sn 
## samtools view -o TKO_GSK669_2_sn.sam TKO_GSK669_2_sn.bam 
## samtools sort TKO_GSK669_2/accepted_hits.bam TKO_GSK669_2_s 
## samtools index TKO_GSK669_2_s.bam</code></pre>
<p><span class="success"> For each original accepted_hits.bam file, sorted-by-name SAM and BAM files (for htseq-count), as well as a sorted-by-chromosome-position BAM file (for IGV) are created. </span></p>
</div>
</div>
<div id="inspect-alignments-with-igv" class="section level2">
<h2>Inspect alignments with IGV</h2>
<p><span class="success">
An extensive explanation is described by clicking on this link : <a href="https://www.sthda.com/english/english/wiki/igv-integrative-genomics-viewer">igv-integrative-genomics-viewer</a>.
</span></p>
<p>IGV can be downloaded here : <a href="http://www.broadinstitute.org/igv/">http://www.broadinstitute.org/igv/</a></p>
<p>Briefly, start IGV <i class="fa fa-2x fa-long-arrow-right"></i> Select the correct genome (Human ) <i class="fa fa-2x fa-long-arrow-right"></i> Load BAM file and the GTF file <i class="fa fa-2x fa-long-arrow-right"></i> Search for a gene of interest and zoom in.</p>
<hr />
<p><strong>References:</strong> <a href="http://master.bioconductor.org/help/course-materials/2013/CSAMA2013/">http://master.bioconductor.org/help/course-materials/2013/CSAMA2013/</a></p>
</div>

<script>jQuery(document).ready(function () {jQuery('h1,h2,h3,h4').addClass('formatter-title');});//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Tue, 07 Oct 2014 13:43:21 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Read counting - NGS]]></title>
			<link>https://www.sthda.com/english/wiki/read-counting-ngs</link>
			<guid>https://www.sthda.com/english/wiki/read-counting-ngs</guid>
			<description><![CDATA[<!-- START HTML -->

 <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#load-transcript-database">Load transcript database</a></li>
<li><a href="#load-bam-files">Load bam files</a></li>
<li><a href="#read-counting-in-bam-files">Read counting in bam files</a></li>
<li><a href="#exploratory-data-analysis-of-the-counts">Exploratory data analysis of the counts:</a></li>
<li><a href="#footnotes">Footnotes</a><ul>
<li><a href="#methods-for-counting-reads-which-overlap-features">Methods for counting reads which overlap features</a></li>
</ul></li>
<li><a href="#licence">Licence</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>

<hr />
<pre class="warning"><code>This analysis was performed using R (ver. 3.1.0).</code></pre>
<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p>We will describe how to count next generation sequencing reads, which fall into genomic features. The result is a count matrix which has rows corresponding to genomic ranges and columns which correspond to different experiments or samples.</p>
<pre class="warning"><code>As an example, we will use an RNA-Seq experiment, with files in the `pasillaBamSubset` Bioconductor data package. However, the same functions can be used for DNA-Seq and ChIP-Seq.</code></pre>
</div>
<div id="load-transcript-database" class="section level2">
<h2>Load transcript database</h2>
<pre class="r"><code>#source("http://bioconductor.org/biocLite.R")
#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)</code></pre>
<p>We load a transcript database object. These are prebuilt in R for various well studied organisms, for example <code>TxDb.Hsapiens.UCSC.hg19.knownGene</code>. In addition the <code>makeTranscriptDbFromGFF</code> file can be used to import GFF or GTF gene models. We use the <code>exonsBy</code> function to get a <code>GRangesList</code> object of the exons for each gene.</p>
<pre class="r"><code>#Rename the transcript database
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
#For each gene pull up the exons
grl <- exonsBy(txdb, by="gene")
#exons of the 100th gene
grl[100]</code></pre>
<pre><code>## GRangesList of length 1:
## $FBgn0000286 
## GRanges with 8 ranges and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr2L [4876890, 4879196]      - |      8515        <NA>
##   [2]    chr2L [4877289, 4879196]      - |      8516        <NA>
##   [3]    chr2L [4880294, 4880472]      - |      8517        <NA>
##   [4]    chr2L [4880378, 4880472]      - |      8518        <NA>
##   [5]    chr2L [4881215, 4882492]      - |      8519        <NA>
##   [6]    chr2L [4882865, 4883113]      - |      8520        <NA>
##   [7]    chr2L [4882889, 4883113]      - |      8521        <NA>
##   [8]    chr2L [4882889, 4883341]      - |      8522        <NA>
## 
## ---
## seqlengths:
##      chr2L     chr2R     chr3L     chr3R      chr4      chrX ...  chr3LHet  chr3RHet   chrXHet   chrYHet chrUextra
##   23011544  21146708  24543557  27905053   1351857  22422827 ...   2555491   2517507    204112    347038  29004656</code></pre>
<pre class="r"><code>grl[[100]]</code></pre>
<pre><code>## GRanges with 8 ranges and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr2L [4876890, 4879196]      - |      8515        <NA>
##   [2]    chr2L [4877289, 4879196]      - |      8516        <NA>
##   [3]    chr2L [4880294, 4880472]      - |      8517        <NA>
##   [4]    chr2L [4880378, 4880472]      - |      8518        <NA>
##   [5]    chr2L [4881215, 4882492]      - |      8519        <NA>
##   [6]    chr2L [4882865, 4883113]      - |      8520        <NA>
##   [7]    chr2L [4882889, 4883113]      - |      8521        <NA>
##   [8]    chr2L [4882889, 4883341]      - |      8522        <NA>
##   ---
##   seqlengths:
##        chr2L     chr2R     chr3L     chr3R      chr4      chrX ...  chr3LHet  chr3RHet   chrXHet   chrYHet chrUextra
##     23011544  21146708  24543557  27905053   1351857  22422827 ...   2555491   2517507    204112    347038  29004656</code></pre>
<pre class="r"><code>grl[[100]][1]#first exon of the 100th gene</code></pre>
<pre><code>## GRanges with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr2L [4876890, 4879196]      - |      8515        <NA>
##   ---
##   seqlengths:
##        chr2L     chr2R     chr3L     chr3R      chr4      chrX ...  chr3LHet  chr3RHet   chrXHet   chrYHet chrUextra
##     23011544  21146708  24543557  27905053   1351857  22422827 ...   2555491   2517507    204112    347038  29004656</code></pre>
</div>
<div id="load-bam-files" class="section level2">
<h2>Load bam files</h2>
<p>These functions in the <code>pasillaBamSubset</code> package just point us to the BAM files.</p>
<pre class="r"><code>fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()
fl1</code></pre>
<pre><code>## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam"</code></pre>
</div>
<div id="read-counting-in-bam-files" class="section level2">
<h2>Read counting in bam files</h2>
<p>We need the following libraries for counting BAM files.</p>
<pre class="r"><code>library(Rsamtools)
library(GenomicRanges)</code></pre>
<pre class="warning"><code>Note: if you are using Bioconductor version 14, paired with R 3.1, you should also load this library. You do not need to load this library, and it will not be available to you, if you are using Bioconductor version 13, paired with R 3.0.x.</code></pre>
<pre class="r"><code>#Required for Bioc 14 and R 3.1
library(GenomicAlignments)</code></pre>
<p>We specify the files using the <code>BamFileList</code> function. This allows us to tell the read counting functions how many reads to load at once. For larger files, yield size of 1 million reads might make sense.</p>
<p>The yield size is how many reads that we want to pull from each bam file.</p>
<pre class="r"><code>fls <- BamFileList(c(fl1, fl2), yieldSize=5e4)
names(fls) <- c("first","second")</code></pre>
<p>The following function counts the overlaps of the reads in the BAM files in the features, which are the genes of Drosophila. We tell the counting function to ignore the strand, i.e., to allow minus strand reads to count in plus strand genes, and vice versa.</p>
<pre class="r"><code>so1 <- summarizeOverlaps(features=grl,
                         reads=fls,
                         ignore.strand=TRUE)

#summarized experiment
#The rowData contains information about the features
#and the colData contains information about the files which we specified.
so1</code></pre>
<pre><code>## class: SummarizedExperiment 
## dim: 15682 2 
## exptData(0):
## assays(1): counts
## rownames(15682): FBgn0000003 FBgn0000008 ... FBgn0264726 FBgn0264727
## rowData metadata column names(0):
## colnames(2): first second
## colData names(0):</code></pre>
<p><strong>Others important parameters of summarizeOverlaps are :</strong> <strong>inter.feature</strong>: default is true. We are only interested in counting reads which uniquely align to one feature. <strong>singleEnd</strong> : default is true. Which says that these files have single-end reads instead of paired-end reads. <strong>fragments</strong> : If you are counting reads in a paired-end experiment, it’s specifying that you also want to count reads where only one of the two reads in a pair aligns.</p>
<p>We can examine the count matrix, which is stored in the <code>assay</code> slot:</p>
<pre class="r"><code>head(assay(so1))</code></pre>
<pre><code>##             first second
## FBgn0000003     0      0
## FBgn0000008     0      0
## FBgn0000014     0      0
## FBgn0000015     0      0
## FBgn0000017     0      0
## FBgn0000018     0      0</code></pre>
<pre class="r"><code>#sum of the count matrix
#number of reads which aligned-- uniquely-- to these features.
colSums(assay(so1))</code></pre>
<pre><code>##  first second 
## 156469 122872</code></pre>
<pre class="r"><code>#Information about the features (genes)
rowData(so1)</code></pre>
<pre><code>## GRangesList of length 15682:
## $FBgn0000003 
## GRanges with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr3R [2648220, 2648518]      + |     45123        <NA>
## 
## $FBgn0000008 
## GRanges with 13 ranges and 2 metadata columns:
##        seqnames               ranges strand   | exon_id exon_name
##    [1]    chr2R [18024494, 18024531]      +   |   20314      <NA>
##    [2]    chr2R [18024496, 18024713]      +   |   20315      <NA>
##    [3]    chr2R [18024938, 18025756]      +   |   20316      <NA>
##    [4]    chr2R [18025505, 18025756]      +   |   20317      <NA>
##    [5]    chr2R [18039159, 18039200]      +   |   20322      <NA>
##    ...      ...                  ...    ... ...     ...       ...
##    [9]    chr2R [18058283, 18059490]      +   |   20326      <NA>
##   [10]    chr2R [18059587, 18059757]      +   |   20327      <NA>
##   [11]    chr2R [18059821, 18059938]      +   |   20328      <NA>
##   [12]    chr2R [18060002, 18060339]      +   |   20329      <NA>
##   [13]    chr2R [18060002, 18060346]      +   |   20330      <NA>
## 
## ...
## <15680 more elements>
## ---
## seqlengths:
##      chr2L     chr2R     chr3L     chr3R      chr4      chrX ...  chr3LHet  chr3RHet   chrXHet   chrYHet chrUextra
##   23011544  21146708  24543557  27905053   1351857  22422827 ...   2555491   2517507    204112    347038  29004656</code></pre>
<pre class="r"><code>#Information about the samples (here it&amp;#39;s empty)
colData(so1)</code></pre>
<pre><code>## DataFrame with 2 rows and 0 columns</code></pre>
<pre class="r"><code>#add information (sample columns)
colData(so1)$sample <- c("one","two")
colData(so1)</code></pre>
<pre><code>## DataFrame with 2 rows and 1 column
##             sample
##        <character>
## first          one
## second         two</code></pre>
<pre class="r"><code>#metadata  information on the features
#It tells you information which is how this row data was generated.
#It was generated from a transcriptDb using the genomic features package.
#The data source was UCSC, the genome was dm3, the organism was drosphila, etc.
#pretty useful for reproducibility of count tables.
metadata(rowData(so1))</code></pre>
<pre><code>## $genomeInfo
## $genomeInfo$`Db type`
## [1] "TranscriptDb"
## 
## $genomeInfo$`Supporting package`
## [1] "GenomicFeatures"
## 
## $genomeInfo$`Data source`
## [1] "UCSC"
## 
## $genomeInfo$Genome
## [1] "dm3"
## 
## $genomeInfo$Organism
## [1] "Drosophila melanogaster"
## 
## $genomeInfo$`UCSC Table`
## [1] "ensGene"
## 
## $genomeInfo$`Resource URL`
## [1] "http://genome.ucsc.edu/"
## 
## $genomeInfo$`Type of Gene ID`
## [1] "Ensembl gene ID"
## 
## $genomeInfo$`Full dataset`
## [1] "yes"
## 
## $genomeInfo$`miRBase build ID`
## [1] NA
## 
## $genomeInfo$transcript_nrow
## [1] "29173"
## 
## $genomeInfo$exon_nrow
## [1] "76920"
## 
## $genomeInfo$cds_nrow
## [1] "62135"
## 
## $genomeInfo$`Db created by`
## [1] "GenomicFeatures package from Bioconductor"
## 
## $genomeInfo$`Creation time`
## [1] "2014-03-17 16:24:54 -0700 (Mon, 17 Mar 2014)"
## 
## $genomeInfo$`GenomicFeatures version at creation time`
## [1] "1.15.11"
## 
## $genomeInfo$`RSQLite version at creation time`
## [1] "0.11.4"
## 
## $genomeInfo$DBSCHEMAVERSION
## [1] "1.0"</code></pre>
</div>
<div id="exploratory-data-analysis-of-the-counts" class="section level2">
<h2>Exploratory data analysis of the counts:</h2>
<pre class="r"><code>x <- assay(so1)[,1]
hist(x[x > 0], col="grey")
#We can see that there&amp;#39;s at least one very, very large count here (more than 40,000).
#if we exclude accounts over 10,000, we can see a bit more of the distribution.
hist(x[x > 0 &amp; x < 10000], col="grey")
#Scatterplot of the 2 samples in log scale
plot(assay(so1) + 1, log="xy")# +1 to avoid log (0)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/read-counting-eda-of-the-counts1.png" title="plot of chunk eda-of-the-counts" alt="plot of chunk eda-of-the-counts" width="288" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/read-counting-eda-of-the-counts2.png" title="plot of chunk eda-of-the-counts" alt="plot of chunk eda-of-the-counts" width="288" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/read-counting-eda-of-the-counts3.png" title="plot of chunk eda-of-the-counts" alt="plot of chunk eda-of-the-counts" width="288" /></p>
<p><span class="warning"> The second file should actually be counted in a special manner, as it contains pairs of reads which come from a single fragment. We do not want to count these twice, so we set <code>singleEnd = FALSE</code>. Additionally, we specify <code>fragments = TRUE</code> which counts reads if only one of the pair aligns to the features, and the other pair aligns to no feature. </span></p>
<pre class="r"><code># ?untreated3_chr4
# ?summarizeOverlaps
fls <- BamFileList(fl2, yieldSize=5e4)
so2 <- summarizeOverlaps(features=grl,
                         reads=fls,
                         ignore.strand=TRUE,
                         singleEnd=FALSE, 
                         fragments=TRUE)#if only one of a pair of reads maps, we also want to count this.
#we can look at the number of reads which align this time.
#And we can note that the last time, when we count each read alone, we had about two times more reads.
#So 60,000 to 120,000.
colSums(assay(so2))</code></pre>
<pre><code>## untreated3_chr4.bam 
##               65591</code></pre>
<pre class="r"><code>colSums(assay(so1))</code></pre>
<pre><code>##  first second 
## 156469 122872</code></pre>
<pre class="r"><code>plot(assay(so1)[,2], assay(so2)[,1], xlim=c(0,5000), ylim=c(0,5000),
     xlab="single end counting", ylab="paired end counting")
abline(0,1)
abline(0,.5)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/read-counting-unnamed-chunk-10.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" width="288" /></p>
<p>scatter plot: when we counted each read singly and we counted only pairs. If you make a y equals x plot line and a y equals 1/2 x line, you can see essentially these counts are about half, because when we did the single-end counting, we were counting each read instead of each fragment.</p>
</div>
<div id="footnotes" class="section level2">
<h2>Footnotes</h2>
<div id="methods-for-counting-reads-which-overlap-features" class="section level3">
<h3>Methods for counting reads which overlap features</h3>
<p>Bioconductor packages:</p>
<ul>
<li><code>summarizeOverlaps</code> in the <code>GenomicAlignments</code> package</li>
</ul>
<p><a href="http://www.bioconductor.org/packages/devel/bioc/html/GenomicAlignments.html">http://www.bioconductor.org/packages/devel/bioc/html/GenomicAlignments.html</a></p>
<ul>
<li><code>featureCounts</code> in the <code>Rsubread</code> package</li>
</ul>
<p>Liao Y, Smyth GK, Shi W., “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics. 2014 <a href="http://www.ncbi.nlm.nih.gov/pubmed/24227677">http://www.ncbi.nlm.nih.gov/pubmed/24227677</a> <a href="http://bioinf.wehi.edu.au/featureCounts/">http://bioinf.wehi.edu.au/featureCounts/</a></p>
<ul>
<li><code>easyRNAseq</code> package</li>
</ul>
<p>Delhomme N1, Padioleau I, Furlong EE, Steinmetz LM. “easyRNASeq: a bioconductor package for processing RNA-Seq data.” Bioinformatics. 2012. <a href="http://www.ncbi.nlm.nih.gov/pubmed/22847932">http://www.ncbi.nlm.nih.gov/pubmed/22847932</a> <a href="http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html">http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html</a></p>
<p>Command line tools:</p>
<ul>
<li><code>htseq-count</code>, a program in the <code>htseq</code> Python package</li>
</ul>
<p>Simon Anders, Paul Theodor Pyl, Wolfgang Huber. HTSeq — A Python framework to work with high-throughput sequencing data bioRxiv preprint (2014), doi: <a href="http://dx.doi.org/10.1101/002824">10.1101/002824</a></p>
<p><a href="http://www-huber.embl.de/users/anders/HTSeq/doc/count.html">http://www-huber.embl.de/users/anders/HTSeq/doc/count.html</a></p>
<ul>
<li><p><code>bedtools</code> <a href="https://code.google.com/p/bedtools/">https://code.google.com/p/bedtools/</a></p></li>
<li><p><code>bedops</code> <a href="https://code.google.com/p/bedops/">https://code.google.com/p/bedops/</a></p></li>
</ul>
</div>
</div>
<div id="licence" class="section level2">
<h2>Licence</h2>
<p><a href="https://www.sthda.com/english/english/wiki/mit-licence">Licence</a></p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p><a href="https://github.com/genomicsclass">https://github.com/genomicsclass</a></p>
</div>

<script>jQuery(document).ready(function () {jQuery('h1,h2,h3,h4').addClass('formatter-title');});//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- END HTML -->]]></description>
			<pubDate>Mon, 29 Sep 2014 15:50:23 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Exploratory data analysis for next generation sequencing]]></title>
			<link>https://www.sthda.com/english/wiki/exploratory-data-analysis-for-next-generation-sequencing</link>
			<guid>https://www.sthda.com/english/wiki/exploratory-data-analysis-for-next-generation-sequencing</guid>
			<description><![CDATA[<!-- START HTML -->
 <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#download-data">Download data</a></li>
<li><a href="#exploratory-data-analysis-for-ngs">Exploratory Data Analysis for NGS</a><ul>
<li><a href="#histogram">Histogram</a></li>
<li><a href="#ma-plot">MA plot</a></li>
</ul></li>
<li><a href="#licence">Licence</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>

<hr />
<pre class="warning"><code>This analysis was performed using R (ver. 3.1.0).</code></pre>
<div id="download-data" class="section level2">
<h2>Download data</h2>
<p>We’ll use a data from Bottomly et al., sequencing two strains of mouse with many biological replicates. This dataset and a number of other sequencing datasets have been compiled from raw data into read counts tables by Frazee, Langmead, and Leek as part of the ReCount project. These datasets are made publicly available at the following website: <a href="http://bowtie-bio.sourceforge.net/recount/">http://bowtie-bio.sourceforge.net/recount/</a>.</p>
<p>We can make similar figures for NGS to the ones shown in the previous sections for microarray data. However, the <strong>log transform does not work</strong> because RNAseq data contains many 0s. One quick way to get around this is by adding a constant (for example 0.5) before taking the log. A typical one is 0.5 which gives us a log2 value of -1 for 0s.</p>
<pre class="r"><code>#Download data
destfile="~/hubiC/Documents/R/doc/english/genomics/bottomly_eset.RData"
if (!file.exists(destfile)) download.file("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData", destfile)
#Load the expression eset
load(destfile)
bottomly.eset</code></pre>
<pre><code>## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 36536 features, 21 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: SRX033480 SRX033488 ... SRX033494 (21 total)
##   varLabels: sample.id num.tech.reps ... lane.number (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
##     ENSMUSG00000090268 (36536 total)
##   fvarLabels: gene
##   fvarMetadata: labelDescription
## experimentData: use &amp;#39;experimentData(object)&amp;#39;
## Annotation:</code></pre>
<pre class="r"><code>#The counts for each sample of reads which align to the first gene.
library("Biobase")
exprs(bottomly.eset)[1,]</code></pre>
<pre><code>## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482 SRX033490 SRX033483 
##       369       744       287       769       348       803       433 
## SRX033476 SRX033478 SRX033479 SRX033472 SRX033473 SRX033474 SRX033475 
##       469       585       321       301       461       309       374 
## SRX033491 SRX033484 SRX033492 SRX033485 SRX033493 SRX033486 SRX033494 
##       781       555       820       294       758       419       857</code></pre>
<pre class="r"><code>#Phenotypic data
pData(bottomly.eset)</code></pre>
<pre><code>##           sample.id num.tech.reps   strain experiment.number lane.number
## SRX033480 SRX033480             1 C57BL/6J                 6           1
## SRX033488 SRX033488             1 C57BL/6J                 7           1
## SRX033481 SRX033481             1 C57BL/6J                 6           2
## SRX033489 SRX033489             1 C57BL/6J                 7           2
## SRX033482 SRX033482             1 C57BL/6J                 6           3
## SRX033490 SRX033490             1 C57BL/6J                 7           3
## SRX033483 SRX033483             1 C57BL/6J                 6           5
## SRX033476 SRX033476             1 C57BL/6J                 4           6
## SRX033478 SRX033478             1 C57BL/6J                 4           7
## SRX033479 SRX033479             1 C57BL/6J                 4           8
## SRX033472 SRX033472             1   DBA/2J                 4           1
## SRX033473 SRX033473             1   DBA/2J                 4           2
## SRX033474 SRX033474             1   DBA/2J                 4           3
## SRX033475 SRX033475             1   DBA/2J                 4           5
## SRX033491 SRX033491             1   DBA/2J                 7           5
## SRX033484 SRX033484             1   DBA/2J                 6           6
## SRX033492 SRX033492             1   DBA/2J                 7           6
## SRX033485 SRX033485             1   DBA/2J                 6           7
## SRX033493 SRX033493             1   DBA/2J                 7           7
## SRX033486 SRX033486             1   DBA/2J                 6           8
## SRX033494 SRX033494             1   DBA/2J                 7           8</code></pre>
</div>
<div id="exploratory-data-analysis-for-ngs" class="section level2">
<h2>Exploratory Data Analysis for NGS</h2>
<p>Something which distinguishes the sequencing experiments from the microarray experiments is that in the sequencing datasets we’ll have a number of features which, for a given sample, a value of exactly zero. Meaning that they were zero reads aligning to that feature for that sample.</p>
<div id="histogram" class="section level3">
<h3>Histogram</h3>
<p>For every sample, I’m going to make a smooth histogram of these log2 plus 1/2 values.</p>
<pre class="r"><code>#Log transformation
Y <- log2(exprs(bottomly.eset) + 0.5)
# library(devtools)
# install_github("rafalib","ririzarr")
library("rafalib")
#smooth histogram of the log2 of the counts for each sample
#different color for each line. And the add just says that only create
#a plot for the first sample and then the rest you should add.
for(i in 1:ncol(Y)){
  shist(Y[,i],unit=0.25,col=i,plotHist=FALSE,add=i!=1)
}</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-for-next-generation-sequencing-histogram.png" title="plot of chunk histogram" alt="plot of chunk histogram" width="384" /></p>
<p>If we get rid of the zeros (i.e., those with log2 value of -1), we can more easily see that shape of the distribution for the expressed genes:</p>
<pre class="r"><code>for(i in 1:ncol(Y)){
  idx <- Y[,i] > -1
  shist(Y[idx,i],unit=0.25,col=i,plotHist=FALSE,add=i!=1)
}</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-for-next-generation-sequencing-histogram-remove-zero.png" title="plot of chunk histogram-remove-zero" alt="plot of chunk histogram-remove-zero" width="384" /></p>
<p>Plotting two samples against each other shows the spreading of points at the low end of expression from the log transformation. This can also be seen with randomly generated Poisson data.</p>
<pre class="r"><code>#Remove rows with zero value and make the plot
idx <- rowSums(Y[,1:2]) > 0
plot(Y[idx,1], Y[idx,2], cex=.1)
rm <- rowMeans(2^Y[idx,1:2])
simulated1 <- rpois(length(idx), rm)
simulated2 <- rpois(length(idx), rm)
plot(log2(simulated1 + .5), log2(simulated2 + .5), cex=.1)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-for-next-generation-sequencing-plotting-two-samples1.png" title="plot of chunk plotting-two-samples" alt="plot of chunk plotting-two-samples" width="384" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-for-next-generation-sequencing-plotting-two-samples2.png" title="plot of chunk plotting-two-samples" alt="plot of chunk plotting-two-samples" width="384" /></p>
</div>
<div id="ma-plot" class="section level3">
<h3>MA plot</h3>
<p>The MA plot is again easier to look at, in that we don’t have to rotate our heads sideways by 45 degrees to see deviations from the diagonal. The MA plot, is the log ratio on the y-axis and the average of the log values on the x-axis. It gives us a lot of the same information as this scatter plot. But it is effectively rotated 45 degrees down.</p>
<pre class="r"><code>maplot(Y[idx,1],Y[idx,2])</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-for-next-generation-sequencing-unnamed-chunk-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" width="384" /></p>
</div>
</div>
<div id="licence" class="section level2">
<h2>Licence</h2>
<p><a href="https://www.sthda.com/english/english/wiki/mit-licence">Licence</a></p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p><a href="https://github.com/genomicsclass">https://github.com/genomicsclass</a></p>
</div>

<script>jQuery(document).ready(function () {jQuery('h1,h2,h3,h4').addClass('formatter-title');});//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->
<!-- END HTML -->]]></description>
			<pubDate>Thu, 25 Sep 2014 20:18:09 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Mapping algorithms and softwares]]></title>
			<link>https://www.sthda.com/english/wiki/mapping-algorithms-and-softwares</link>
			<guid>https://www.sthda.com/english/wiki/mapping-algorithms-and-softwares</guid>
			<description><![CDATA[<!-- START HTML -->
 <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">
  <style>h1,h2,h3,h4{margin-top:20px; margin-bottom:10px;}</style>

<div id="TOC">
<ul>
<li><a href="#fastq-file">FASTQ file</a></li>
<li><a href="#mapping-algorithms-and-softwares">Mapping Algorithms and Softwares</a><ul>
<li><a href="#download-sequencing-data-and-extract-the-fasq-files">Download sequencing data and extract the FASQ files</a></li>
<li><a href="#align-rna-sequencing-data-using-tophat">Align RNA sequencing data using Tophat</a></li>
<li><a href="#use-bioinformatic-cluster-to-align-reads">Use bioinformatic cluster to align reads</a></li>
<li><a href="#tophat-result-files">Tophat result files</a></li>
</ul></li>
<li><a href="#licence">Licence</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>

<hr />
<pre class="warning"><code>This analysis was performed using R (ver. 3.1.0).</code></pre>
<div id="fastq-file" class="section level2">
<h2>FASTQ file</h2>
<p><strong>FASTQ file</strong>, is the result of a next-generation sequencing experiment. Every four lines, indicates a <strong>read</strong>. The <strong>first line</strong> gives the name of the read and some other information, including its length. The <strong>second line</strong> is the base pairs of the read. The <strong>fourth line</strong> contains information about the quality of each base pair. The <strong>third line</strong> links all the qualities to the first line.</p>
<p><strong>FASTQ Format</strong></p>
<p>A FASTQ file normally uses four lines per sequence.</p>
<ul>
<li>Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).</li>
<li>Line 2 is the raw sequence letters.</li>
<li>Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.</li>
<li>Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.</li>
</ul>
<p>A FASTQ file containing a single sequence might look like this:</p>
<pre><code>@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!&amp;#39;&amp;#39;*((((***+))%%%++)(%%%%).1***-+*&amp;#39;&amp;#39;))**55CCF>>>>>>CCCCCCC65</code></pre>
<p><a href="http://en.wikipedia.org/wiki/FASTQ_format">source : wiki</a></p>
</div>
<div id="mapping-algorithms-and-softwares" class="section level2">
<h2>Mapping Algorithms and Softwares</h2>
<p>Mapping software is also called alignment software. One of the most popular short read aligners is <strong>Bowtie</strong>. Help can be found on <a href="http://bowtie-bio.sourceforge.net">Bowtie website</a> or from linux command: <strong>bowtie –help</strong>.</p>
<div id="download-sequencing-data-and-extract-the-fasq-files" class="section level3">
<h3>Download sequencing data and extract the FASQ files</h3>
<p><strong>Download RNA sequencing reads from Short Read Archive (SRA)</strong> :</p>
<p>GSE52166, SRP032775, <a href="http://www.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1177756">http://www.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1177756</a>.</p>
<pre class="bash"><code>wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1177756/SRR1177756.sra</code></pre>
<p>We use the software <strong>fastq-dump</strong> to extract the FASTQ files from the .sra file :</p>
<pre class="bash"><code>fastq-dump --split-file-3 SRR1177756.sra
# view generated files with size
ls -lh *.fastq</code></pre>
<pre class="warning"><code>The option --split-file-3 is used for paired-end sequencing.</code></pre>
<pre class="success"><code>Two FastQ files are generated (SRR1177756_1.fastq, SRR1177756_2.fastq), because data is a paired-end sequencing. Each read is paired with the same name in each file. </code></pre>
</div>
<div id="align-rna-sequencing-data-using-tophat" class="section level3">
<h3>Align RNA sequencing data using Tophat</h3>
<p><strong>TopHat</strong> software is used to align RNA sequencing data. It takes care of the fact that the reads might span an intron. The TopHat software actually uses <strong>Bowtie</strong> internally to do part of the mapping.</p>
<p>To use TopHat, you need a Bowtie index, which is a <strong>precompiled version of the genome</strong>, which is efficiently built so that you can map these short reads against it. <strong>Bowtie index</strong> can be downloaded for different species on <a href="http://ccb.jhu.edu/software/tophat/igenomes.shtml">TopHat website</a>.</p>
<p><strong>Download reference genome from tophat website:</strong></p>
<pre class="bash"><code>wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/Ensembl/GRCh37/Homo_sapiens_Ensembl_GRCh37.tar.gz
tar zxvf Homo_sapiens_Ensembl_GRCh37.tar.gz</code></pre>
<p><strong>Running TopHat:</strong></p>
<p>Tophat needs the path to reference genome. This path might be different on your computer.</p>
<pre class="bash"><code>tophat2 -o SRR117756_out -p 10 genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index SRR1177756_1.fastq  SRR1177756_2.fastq</code></pre>
<p>The 2 files indicate left and right paired-end reads.</p>
<pre class="notice"><code>In the call to tophat2, the option -o specifies the output directory, -p specifies the number of threads to use (this may affect run times and can vary depending on the resources available).</code></pre>
</div>
<div id="use-bioinformatic-cluster-to-align-reads" class="section level3">
<h3>Use bioinformatic cluster to align reads</h3>
<p><span class="warning"> A bioinformatic cluster is needed to process huge data generated by next generation sequencing. I use <a href="http://bioinfo.genotoul.fr/index.php">genotool plateform</a> to run my alignement experiment. The aim of the following section is to show, how to submit a job to genotoul. You can skip this section </span></p>
<p><strong>Submit job to genotoul</strong></p>
<p><strong>My working directory hierarchy:</strong></p>
<ul>
<li>work</li>
<li>akassambara
<ul>
<li>seq</li>
<li>genomes : contains reference genome</li>
<li>SRR117756_out</li>
<li>SRR1177756_1.fastq</li>
<li>SRR1177756_2.fastq</li>
<li>SRR1177756.sra</li>
<li>tophat_SRR1177756.sh</li>
</ul></li>
</ul>
<p><strong>Preparing my script (tophat_SRR1177756.sh) to run on genotool:</strong></p>
<pre class="bash"><code>#!/bin/bash
export BOWTIE2_INDEXES=/work/akassambara/seq/genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index
tophat2 -o SRR117756_out -p 10 genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome SRR1177756_1.fastq SRR1177756_2.fastq</code></pre>
<p><strong>Submit the job</strong></p>
<pre class="bash"><code>qsub -q workq -l h_vmem=100G -l mem=100G tophat_SRR1177756.sh </code></pre>
</div>
<div id="tophat-result-files" class="section level3">
<h3>Tophat result files</h3>
<p>TopHat generates a list of files which are located in SRR117756_out folder.</p>
<pre class="bash"><code>akassambara@genotoul ~/work/seq/SRR117756_out $ ls -lh
total 4,1G
-rw-r--r-- 1 akassambara U1040 2,2G 30 juil. 05:12 accepted_hits.bam
-rw-r--r-- 1 akassambara U1040 9,5M 30 juil. 04:57 deletions.bed
-rw-r--r-- 1 akassambara U1040 7,4M 30 juil. 04:57 insertions.bed
-rw-r--r-- 1 akassambara U1040  11M 30 juil. 04:57 junctions.bed
drwxr-xr-x 2 akassambara U1040  16K 30 juil. 05:12 logs
-rw-r--r-- 1 akassambara U1040  188 29 juil. 16:59 prep_reads.info
-rw-r--r-- 1 akassambara U1040 1,9G 30 juil. 05:24 unmapped.bam</code></pre>
<div id="accepted_hits.bam-file" class="section level4">
<h4>accepted_hits.bam file</h4>
<p>This is a compressed file. It contains the reads which could be successfully mapped to the genome. You can use the <strong>samtools</strong> view function to read this compressed file.</p>
<p>The following line code can be used to view the first 1000 lines. Use “return” to view more and type q to quit.</p>
<pre class="bash"><code>#Look at the first 1000 lines, and then use less.
#You have to type q to quit less command
samtools view accepted_hits.bam | head -1000 | less</code></pre>
<pre><code>SRR1177756.30816791     393     1       12006   0       101M    *       0       0       GCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGAT   BBBBF<BFFFFFFIFIFBFFFFFIIIFFFIIFBBFFIBFFFFFFBBFFFB7<BFFFFBBFFFFFB&amp;#39;<BBBBFFFBBB<<<B<BBBBBBBBBB7<BB07B7<   AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:43C57      YT:Z:UU NH:i:6  CC:Z:15 CP:i:102519064  HI:i:0
SRR1177756.42014053     73      1       12218   1       10M385N91M      *       0       0       ACCTAGGCCAGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGT   BBBFFFFFFFFFFFIIIIIIIIFIIIIIIIIFFIIIIIIIIIIIIIIIIIBFIIFFFIIIIIIIIIFFFFFFFBFBFFFFBFBFFFFFFFFFFFFBBBFB<   AS:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:101        NM:i:0  XS:A:+  NH:i:3  CC:Z:15 CP:i:102518467  HI:i:0</code></pre>
<p>This gives each read with the name of the read on the left, and where it was possible to align the read to the genome. For example, the read SRR1177756.30816791 was aligned to chromosome 1 at position 12,006. We could see that this whole read aligned (101 matches).</p>
<pre class="warning"><code>Down below, we see that there are reads where the whole read didn&amp;#39;t align. If we consider that we&amp;#39;re looking at RNA sequencing, and that some of these reads come from RNA pieces, which span junction in the genome, it makes sense that there would be parts of these which wouldn&amp;#39;t align to the genome. So instead of a match, we get an N, which indicates there was a gap: 10M385N91M = 10 matches then 385 nucleotide gaps then 91 matches. </code></pre>
<p>The following SAMtools calls can be used to process the BAM files.</p>
<pre class="bash"><code>samtools sort -n accepted_hits.bam accepted_hits_name_sorted
samtools sort -n unmapped.bam unmapped_name_sorted
samtools merge -n all_reads.bam accepted_hits_name_sorted unmapped_name_sorted</code></pre>
<p>The -n option indicates that I want to sort these reads, not by genomic location but, by name of the read. Sorted reads are put in the accepted_hits_name_sorted file.</p>
<p><strong>Sorted reads:</strong>
<img src="https://www.sthda.com/english/sthda/RDoc/images/mapping-sorted-reads.png" alt="mapping-sorted-reads" /></p>
<p>After sorting, you can see something interesting, which is that the fragment number 4 only has one read in this file. And also you can see that fragment number 1, and 6, and 7 are also not in this file. If we want to find those reads, we can look into the unmapped.bam.</p>
<pre class="success"><code>If we want a single file with all of the alignments in it, we can use the samtools merge command. </code></pre>
</div>
</div>
</div>
<div id="licence" class="section level2">
<h2>Licence</h2>
<p><a href="https://www.sthda.com/english/english/wiki/mit-licence">Licence</a></p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p><a href="https://github.com/genomicsclass">https://github.com/genomicsclass</a></p>
</div>

<script>jQuery(document).ready(function () {jQuery('h1,h2,h3,h4').addClass('formatter-title');});//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->
<!-- END HTML -->]]></description>
			<pubDate>Thu, 25 Sep 2014 20:05:10 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[High-throughput Sequencing]]></title>
			<link>https://www.sthda.com/english/wiki/high-throughput-sequencing</link>
			<guid>https://www.sthda.com/english/wiki/high-throughput-sequencing</guid>
			<description><![CDATA[-=-]]></description>
			<pubDate>Thu, 25 Sep 2014 17:49:20 +0200</pubDate>
			
		</item>
		
	</channel>
</rss>
