Easy Guides

Tophat2 : Download, build reference genome and align the reads to the reference genome

Tue, 07 Oct 2014 13:43:21 +0200

Objectives
Download data
Download the reference genome
- Download reference genome from Ensembl
- Download reference genome from UCSC
Get gene model annotations
Build the reference index
Workflow
Inspect alignments with IGV

This analysis was performed using R (ver. 3.1.0).

Objectives

The aim of this article is to show you:

How to download and build reference genome?
How to align the reads using Tophat2?
How to inspect BAM file using IGV?

Download data

The data from Popovic et al., (GEO accession number: GSE57478) where used in the following example. The SRA files are available here : http://www.ncbi.nlm.nih.gov/sra?term=SRP032510.

Download the SRA files and convert to FASTQ files. This is described here.

Download the reference genome

Create a directory called “genome/” to save your reference genome.

Note that bowtie2/tophat2 indices for many commonly used reference genomes can be downloaded directly from http://tophat.cbcb.umd.edu/igenomes.html.

Reference genome index (from FASTA file) for bowtie2/tophat2, can be build by following the explanation down below.

User have to download the reference genome sequence for the organism under study in (compressed) FASTA format. This can be done from Ensembl and UCSC databases among many others.

Ensembl FTP server : http://www.ensembl.org/info/data/ftp/index.html UCSC FTP server : ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/

For Ensembl, choose the FASTA (DNA) link instead of FASTA (cDNA), since alignments to the genome, not the transcriptome, are desired.

Download reference genome from Ensembl

In this article, homo sapiens reference genome from Ensembl database is used. For homo sapiens the file labeled toplevel combines all chromosomes. Download and uncompress the reference genome, using the following UNIX commands :

wget ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz

Download reference genome from UCSC

Click here to download the data corresponding to your organism of interest
Click on “Homo sapiens” -> bigZips -> download the “chromFa.tar.gz”

UNIX command :

wget ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/chromFa.tar.gz gunzip chromFa.tar.gz

Get gene model annotations

Download a GTF file with gene models for the organism of interest.

Homo sapiens gene model annotation can be downloaded as follow:

wget ftp://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/Homo_sapiens.GRCh38.77.gtf.gz gunzip Homo_sapiens.GRCh38.77.gtf.gz

Always download the FASTA reference sequence and the GTF annotation data from the same resource provider.

Build the reference index

Before reads can be aligned, the reference FASTA files need to be preprocessed into an index that allows the aligner easy access.

To build a bowtie2-specific index from the FASTA file use the command:

bowtie2-build -f Homo_sapiens.GRCh38.dna.toplevel.fa. Homo_sapiens_GRCh38

A set of BT2 files will be produced, with names starting with Homo_sapiens_GRCh38 as specified.

This procedure needs to be run only once for each reference genome used. As mentioned, pre-built indices for many commonly used genomes are available from http://tophat.cbcb.umd.edu/igenomes.html.

Workflow

Assess sequence quality control with ShortRead

From R, set your working directory (using `setwd’ command) to the directory where FASTQ file are situated.
Run the folllowing R code :

library("ShortRead")
QC = qa(dirPath=".", pattern=".fastq$", type="fastq")
report(QC, type="html", dest="fastqQAreport")

The output is an html files located in the “fastqQAreport” directory.

Use a web browser to inspect the generated HTML file.

Prepare your sample metadata informations

# Initial SRA info table
sri<-read.csv("SraRunInfo.csv", stringsAsFactors=FALSE)
# Prepare sample information table
#Add important descriptive columns
samples=as.data.frame(
  list(
     name=c('KMS11_GSK343_1', 'KMS11_GSK343_2','TKO_GSK343_1', 'TKO_GSK343_2',
           'KMS11_GSK669_1', 'KMS11_GSK669_2', 'TKO_GSK669_1', 'TKO_GSK669_2'),
     fastq=paste(sri$Run, ".fastq", sep=""),
     avgLength=sri$avgLength,
     data_type=sri$LibraryStrategy,
     layout = sri$LibraryLayout,
     geoid=sri$SampleName,
     cell_line=c(1, 1, 2, 2, 1,1,2,2),
     treatment=c("ctr", "ctr", "ctr", "ctr", "ezh2i", "ezh2i", "ezh2i", "ezh2i")
    )
  )
rownames(samples)=samples$name
samples[sort(rownames(samples)), ]

##                          name            fastq avgLength data_type layout      geoid cell_line treatment
## KMS11_GSK343_1 KMS11_GSK343_1 SRR1282056.fastq        51   RNA-Seq SINGLE GSM1383539         1       ctr
## KMS11_GSK343_2 KMS11_GSK343_2 SRR1282057.fastq        51   RNA-Seq SINGLE GSM1383540         1       ctr
## KMS11_GSK669_1 KMS11_GSK669_1 SRR1282060.fastq        51   RNA-Seq SINGLE GSM1383543         1     ezh2i
## KMS11_GSK669_2 KMS11_GSK669_2 SRR1282061.fastq        51   RNA-Seq SINGLE GSM1383544         1     ezh2i
## TKO_GSK343_1     TKO_GSK343_1 SRR1282058.fastq        51   RNA-Seq SINGLE GSM1383541         2       ctr
## TKO_GSK343_2     TKO_GSK343_2 SRR1282059.fastq        51   RNA-Seq SINGLE GSM1383542         2       ctr
## TKO_GSK669_1     TKO_GSK669_1 SRR1282062.fastq        51   RNA-Seq SINGLE GSM1383545         2     ezh2i
## TKO_GSK669_2     TKO_GSK669_2 SRR1282063.fastq        51   RNA-Seq SINGLE GSM1383546         2     ezh2i

head(samples)

##                          name            fastq avgLength data_type layout      geoid cell_line treatment
## KMS11_GSK343_1 KMS11_GSK343_1 SRR1282056.fastq        51   RNA-Seq SINGLE GSM1383539         1       ctr
## KMS11_GSK343_2 KMS11_GSK343_2 SRR1282057.fastq        51   RNA-Seq SINGLE GSM1383540         1       ctr
## TKO_GSK343_1     TKO_GSK343_1 SRR1282058.fastq        51   RNA-Seq SINGLE GSM1383541         2       ctr
## TKO_GSK343_2     TKO_GSK343_2 SRR1282059.fastq        51   RNA-Seq SINGLE GSM1383542         2       ctr
## KMS11_GSK669_1 KMS11_GSK669_1 SRR1282060.fastq        51   RNA-Seq SINGLE GSM1383543         1     ezh2i
## KMS11_GSK669_2 KMS11_GSK669_2 SRR1282061.fastq        51   RNA-Seq SINGLE GSM1383544         1     ezh2i

Since the downstream statistical analysis of differential expression relies on this table, carefully inspect (and correct, if necessary) the metadata table.

Align the reads to reference genome using tophat2

Use R to create the list of shell commands :

genome="genome/Homo_sapiens_GRCh38"
cmd=with(samples, paste("tophat2 -o ", name, " -p 10  ",genome, " ", fastq, sep="" ))
for(c in cmd) system(c)

## tophat2 -o KMS11_GSK343_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282056.fastq 
## tophat2 -o KMS11_GSK343_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282057.fastq 
## tophat2 -o TKO_GSK343_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282058.fastq 
## tophat2 -o TKO_GSK343_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282059.fastq 
## tophat2 -o KMS11_GSK669_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282060.fastq 
## tophat2 -o KMS11_GSK669_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282061.fastq 
## tophat2 -o TKO_GSK669_1 -p 10  genome/Homo_sapiens_GRCh38 SRR1282062.fastq 
## tophat2 -o TKO_GSK669_2 -p 10  genome/Homo_sapiens_GRCh38 SRR1282063.fastq

In the call to tophat2, the option -o specifies the output directory, -p specifies the number of threads to use (affect run times, vary depending on the resources available). The first argument is the name of the index. The second argument is a list of all FASTQ files.

Other parameters can be specified to tophat2 :

tophat2 -G Homo_sapiens.GRCh38.77.gtf -o output_dir -p 10 –-no-coverage-search genome/Homo_sapiens_GRCh38 file.fastq

- The option -G points tophat2 to a GTF file of annotation to facilitate mapping reads across exon-exon junctions (some of witch can be found de novo).

The coverage-search algorithm was turned off because it took too long.

Note that the FASTQ files are concatenated with commas, without spaces. For experiments with paired-end reads, pairs of FASTQ files are given as separate arguments and the order in both arguments must match.

paired-end sequencing

tophat2 -o tophat_output_dir -p 8 –no-coverage-search /path/to/genome/Bowtie2Index/genome file_1.fastq file_2.fastq samtools sort -n file_tophat_out/accepted_hits.bam _sorted

Single-end sequencing tophat2 -o tophat_output_dir -p 8 /path/to/genome/Bowtie2Index/genome file.fastq samtools sort -n file_tophat_out/accepted_hits.bam _sorted

Run the command

The commands can be executed by copy and paste in UNIX terminal. system function can be used to execute these command direct from R. The list of the commands can also be stored in a text file and use UNIX source command.

Sort and index the BAM files and create SAM files

BAM files are organized into a single directory. SAM files is required when you want to count reads with htseq-count.

for(i in 1:nrow(samples)){
  lib=samples$name[i]
  bamFile=file.path(lib, "accepted_hits.bam")
  #sort by name and convert to SAM for htseq-count counting
  system(paste0("samtools sort -n ",bamFile," ",lib,"_sn")) #sort
  system(paste0("samtools view -o ",lib,"_sn.sam ",lib,"_sn.bam")) #convert to sam
 
 # sort by position and index for IGV
 system(paste0("samtools sort ",bamFile," ",lib,"_s"))
 system(paste0("samtools index ",lib,"_s.bam"))
}

The following code is executed :

## samtools sort -n KMS11_GSK343_1/accepted_hits.bam KMS11_GSK343_1_sn 
## samtools view -o KMS11_GSK343_1_sn.sam KMS11_GSK343_1_sn.bam 
## samtools sort KMS11_GSK343_1/accepted_hits.bam KMS11_GSK343_1_s 
## samtools index KMS11_GSK343_1_s.bam 
## 
## samtools sort -n KMS11_GSK343_2/accepted_hits.bam KMS11_GSK343_2_sn 
## samtools view -o KMS11_GSK343_2_sn.sam KMS11_GSK343_2_sn.bam 
## samtools sort KMS11_GSK343_2/accepted_hits.bam KMS11_GSK343_2_s 
## samtools index KMS11_GSK343_2_s.bam 
## 
## samtools sort -n TKO_GSK343_1/accepted_hits.bam TKO_GSK343_1_sn 
## samtools view -o TKO_GSK343_1_sn.sam TKO_GSK343_1_sn.bam 
## samtools sort TKO_GSK343_1/accepted_hits.bam TKO_GSK343_1_s 
## samtools index TKO_GSK343_1_s.bam 
## 
## samtools sort -n TKO_GSK343_2/accepted_hits.bam TKO_GSK343_2_sn 
## samtools view -o TKO_GSK343_2_sn.sam TKO_GSK343_2_sn.bam 
## samtools sort TKO_GSK343_2/accepted_hits.bam TKO_GSK343_2_s 
## samtools index TKO_GSK343_2_s.bam 
## 
## samtools sort -n KMS11_GSK669_1/accepted_hits.bam KMS11_GSK669_1_sn 
## samtools view -o KMS11_GSK669_1_sn.sam KMS11_GSK669_1_sn.bam 
## samtools sort KMS11_GSK669_1/accepted_hits.bam KMS11_GSK669_1_s 
## samtools index KMS11_GSK669_1_s.bam 
## 
## samtools sort -n KMS11_GSK669_2/accepted_hits.bam KMS11_GSK669_2_sn 
## samtools view -o KMS11_GSK669_2_sn.sam KMS11_GSK669_2_sn.bam 
## samtools sort KMS11_GSK669_2/accepted_hits.bam KMS11_GSK669_2_s 
## samtools index KMS11_GSK669_2_s.bam 
## 
## samtools sort -n TKO_GSK669_1/accepted_hits.bam TKO_GSK669_1_sn 
## samtools view -o TKO_GSK669_1_sn.sam TKO_GSK669_1_sn.bam 
## samtools sort TKO_GSK669_1/accepted_hits.bam TKO_GSK669_1_s 
## samtools index TKO_GSK669_1_s.bam 
## 
## samtools sort -n TKO_GSK669_2/accepted_hits.bam TKO_GSK669_2_sn 
## samtools view -o TKO_GSK669_2_sn.sam TKO_GSK669_2_sn.bam 
## samtools sort TKO_GSK669_2/accepted_hits.bam TKO_GSK669_2_s 
## samtools index TKO_GSK669_2_s.bam

For each original accepted_hits.bam file, sorted-by-name SAM and BAM files (for htseq-count), as well as a sorted-by-chromosome-position BAM file (for IGV) are created.

Inspect alignments with IGV

An extensive explanation is described by clicking on this link : igv-integrative-genomics-viewer.

IGV can be downloaded here : http://www.broadinstitute.org/igv/

Briefly, start IGV Select the correct genome (Human ) Load BAM file and the GTF file Search for a gene of interest and zoom in.

References: http://master.bioconductor.org/help/course-materials/2013/CSAMA2013/

Read counting - NGS

Mon, 29 Sep 2014 15:50:23 +0200

Introduction
Load transcript database
Load bam files
Read counting in bam files
Exploratory data analysis of the counts:
Footnotes
- Methods for counting reads which overlap features
Licence
References

This analysis was performed using R (ver. 3.1.0).

Introduction

We will describe how to count next generation sequencing reads, which fall into genomic features. The result is a count matrix which has rows corresponding to genomic ranges and columns which correspond to different experiments or samples.

As an example, we will use an RNA-Seq experiment, with files in the `pasillaBamSubset` Bioconductor data package. However, the same functions can be used for DNA-Seq and ChIP-Seq.

Load transcript database

#source("http://bioconductor.org/biocLite.R")
#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

We load a transcript database object. These are prebuilt in R for various well studied organisms, for example TxDb.Hsapiens.UCSC.hg19.knownGene. In addition the makeTranscriptDbFromGFF file can be used to import GFF or GTF gene models. We use the exonsBy function to get a GRangesList object of the exons for each gene.

#Rename the transcript database
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
#For each gene pull up the exons
grl <- exonsBy(txdb, by="gene")
#exons of the 100th gene
grl[100]

## GRangesList of length 1:
## $FBgn0000286 
## GRanges with 8 ranges and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##                       |  
##   [1]    chr2L [4876890, 4879196]      - |      8515        
##   [2]    chr2L [4877289, 4879196]      - |      8516        
##   [3]    chr2L [4880294, 4880472]      - |      8517        
##   [4]    chr2L [4880378, 4880472]      - |      8518        
##   [5]    chr2L [4881215, 4882492]      - |      8519        
##   [6]    chr2L [4882865, 4883113]      - |      8520        
##   [7]    chr2L [4882889, 4883113]      - |      8521        
##   [8]    chr2L [4882889, 4883341]      - |      8522        
## 
## ---
## seqlengths:
##      chr2L     chr2R     chr3L     chr3R      chr4      chrX ...  chr3LHet  chr3RHet   chrXHet   chrYHet chrUextra
##   23011544  21146708  24543557  27905053   1351857  22422827 ...   2555491   2517507    204112    347038  29004656

grl[[100]]

## GRanges with 8 ranges and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##                       |  
##   [1]    chr2L [4876890, 4879196]      - |      8515        
##   [2]    chr2L [4877289, 4879196]      - |      8516        
##   [3]    chr2L [4880294, 4880472]      - |      8517        
##   [4]    chr2L [4880378, 4880472]      - |      8518        
##   [5]    chr2L [4881215, 4882492]      - |      8519        
##   [6]    chr2L [4882865, 4883113]      - |      8520        
##   [7]    chr2L [4882889, 4883113]      - |      8521        
##   [8]    chr2L [4882889, 4883341]      - |      8522        
##   ---
##   seqlengths:
##        chr2L     chr2R     chr3L     chr3R      chr4      chrX ...  chr3LHet  chr3RHet   chrXHet   chrYHet chrUextra
##     23011544  21146708  24543557  27905053   1351857  22422827 ...   2555491   2517507    204112    347038  29004656

grl[[100]][1]#first exon of the 100th gene

## GRanges with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##                       |  
##   [1]    chr2L [4876890, 4879196]      - |      8515        
##   ---
##   seqlengths:
##        chr2L     chr2R     chr3L     chr3R      chr4      chrX ...  chr3LHet  chr3RHet   chrXHet   chrYHet chrUextra
##     23011544  21146708  24543557  27905053   1351857  22422827 ...   2555491   2517507    204112    347038  29004656

Load bam files

These functions in the pasillaBamSubset package just point us to the BAM files.

fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()
fl1

## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/pasillaBamSubset/extdata/untreated1_chr4.bam"

Read counting in bam files

We need the following libraries for counting BAM files.

library(Rsamtools)
library(GenomicRanges)

Note: if you are using Bioconductor version 14, paired with R 3.1, you should also load this library. You do not need to load this library, and it will not be available to you, if you are using Bioconductor version 13, paired with R 3.0.x.

#Required for Bioc 14 and R 3.1
library(GenomicAlignments)

We specify the files using the BamFileList function. This allows us to tell the read counting functions how many reads to load at once. For larger files, yield size of 1 million reads might make sense.

The yield size is how many reads that we want to pull from each bam file.

fls <- BamFileList(c(fl1, fl2), yieldSize=5e4)
names(fls) <- c("first","second")

The following function counts the overlaps of the reads in the BAM files in the features, which are the genes of Drosophila. We tell the counting function to ignore the strand, i.e., to allow minus strand reads to count in plus strand genes, and vice versa.

so1 <- summarizeOverlaps(features=grl,
                         reads=fls,
                         ignore.strand=TRUE)

#summarized experiment
#The rowData contains information about the features
#and the colData contains information about the files which we specified.
so1

## class: SummarizedExperiment 
## dim: 15682 2 
## exptData(0):
## assays(1): counts
## rownames(15682): FBgn0000003 FBgn0000008 ... FBgn0264726 FBgn0264727
## rowData metadata column names(0):
## colnames(2): first second
## colData names(0):

Others important parameters of summarizeOverlaps are : inter.feature: default is true. We are only interested in counting reads which uniquely align to one feature. singleEnd : default is true. Which says that these files have single-end reads instead of paired-end reads. fragments : If you are counting reads in a paired-end experiment, it’s specifying that you also want to count reads where only one of the two reads in a pair aligns.

We can examine the count matrix, which is stored in the assay slot:

head(assay(so1))

##             first second
## FBgn0000003     0      0
## FBgn0000008     0      0
## FBgn0000014     0      0
## FBgn0000015     0      0
## FBgn0000017     0      0
## FBgn0000018     0      0

#sum of the count matrix
#number of reads which aligned-- uniquely-- to these features.
colSums(assay(so1))

##  first second 
## 156469 122872

#Information about the features (genes)
rowData(so1)

## GRangesList of length 15682:
## $FBgn0000003 
## GRanges with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##                       |  
##   [1]    chr3R [2648220, 2648518]      + |     45123        
## 
## $FBgn0000008 
## GRanges with 13 ranges and 2 metadata columns:
##        seqnames               ranges strand   | exon_id exon_name
##    [1]    chr2R [18024494, 18024531]      +   |   20314      
##    [2]    chr2R [18024496, 18024713]      +   |   20315      
##    [3]    chr2R [18024938, 18025756]      +   |   20316      
##    [4]    chr2R [18025505, 18025756]      +   |   20317      
##    [5]    chr2R [18039159, 18039200]      +   |   20322      
##    ...      ...                  ...    ... ...     ...       ...
##    [9]    chr2R [18058283, 18059490]      +   |   20326      
##   [10]    chr2R [18059587, 18059757]      +   |   20327      
##   [11]    chr2R [18059821, 18059938]      +   |   20328      
##   [12]    chr2R [18060002, 18060339]      +   |   20329      
##   [13]    chr2R [18060002, 18060346]      +   |   20330      
## 
## ...
## <15680 more elements>
## ---
## seqlengths:
##      chr2L     chr2R     chr3L     chr3R      chr4      chrX ...  chr3LHet  chr3RHet   chrXHet   chrYHet chrUextra
##   23011544  21146708  24543557  27905053   1351857  22422827 ...   2555491   2517507    204112    347038  29004656

#Information about the samples (here it's empty)
colData(so1)

## DataFrame with 2 rows and 0 columns

#add information (sample columns)
colData(so1)$sample <- c("one","two")
colData(so1)

## DataFrame with 2 rows and 1 column
##             sample
##        
## first          one
## second         two

#metadata  information on the features
#It tells you information which is how this row data was generated.
#It was generated from a transcriptDb using the genomic features package.
#The data source was UCSC, the genome was dm3, the organism was drosphila, etc.
#pretty useful for reproducibility of count tables.
metadata(rowData(so1))

## $genomeInfo
## $genomeInfo$`Db type`
## [1] "TranscriptDb"
## 
## $genomeInfo$`Supporting package`
## [1] "GenomicFeatures"
## 
## $genomeInfo$`Data source`
## [1] "UCSC"
## 
## $genomeInfo$Genome
## [1] "dm3"
## 
## $genomeInfo$Organism
## [1] "Drosophila melanogaster"
## 
## $genomeInfo$`UCSC Table`
## [1] "ensGene"
## 
## $genomeInfo$`Resource URL`
## [1] "http://genome.ucsc.edu/"
## 
## $genomeInfo$`Type of Gene ID`
## [1] "Ensembl gene ID"
## 
## $genomeInfo$`Full dataset`
## [1] "yes"
## 
## $genomeInfo$`miRBase build ID`
## [1] NA
## 
## $genomeInfo$transcript_nrow
## [1] "29173"
## 
## $genomeInfo$exon_nrow
## [1] "76920"
## 
## $genomeInfo$cds_nrow
## [1] "62135"
## 
## $genomeInfo$`Db created by`
## [1] "GenomicFeatures package from Bioconductor"
## 
## $genomeInfo$`Creation time`
## [1] "2014-03-17 16:24:54 -0700 (Mon, 17 Mar 2014)"
## 
## $genomeInfo$`GenomicFeatures version at creation time`
## [1] "1.15.11"
## 
## $genomeInfo$`RSQLite version at creation time`
## [1] "0.11.4"
## 
## $genomeInfo$DBSCHEMAVERSION
## [1] "1.0"

Exploratory data analysis of the counts:

x <- assay(so1)[,1]
hist(x[x > 0], col="grey")
#We can see that there's at least one very, very large count here (more than 40,000).
#if we exclude accounts over 10,000, we can see a bit more of the distribution.
hist(x[x > 0 & x < 10000], col="grey")
#Scatterplot of the 2 samples in log scale
plot(assay(so1) + 1, log="xy")# +1 to avoid log (0)

The second file should actually be counted in a special manner, as it contains pairs of reads which come from a single fragment. We do not want to count these twice, so we set singleEnd = FALSE. Additionally, we specify fragments = TRUE which counts reads if only one of the pair aligns to the features, and the other pair aligns to no feature.

# ?untreated3_chr4
# ?summarizeOverlaps
fls <- BamFileList(fl2, yieldSize=5e4)
so2 <- summarizeOverlaps(features=grl,
                         reads=fls,
                         ignore.strand=TRUE,
                         singleEnd=FALSE, 
                         fragments=TRUE)#if only one of a pair of reads maps, we also want to count this.
#we can look at the number of reads which align this time.
#And we can note that the last time, when we count each read alone, we had about two times more reads.
#So 60,000 to 120,000.
colSums(assay(so2))

## untreated3_chr4.bam 
##               65591

colSums(assay(so1))

##  first second 
## 156469 122872

plot(assay(so1)[,2], assay(so2)[,1], xlim=c(0,5000), ylim=c(0,5000),
     xlab="single end counting", ylab="paired end counting")
abline(0,1)
abline(0,.5)

scatter plot: when we counted each read singly and we counted only pairs. If you make a y equals x plot line and a y equals 1/2 x line, you can see essentially these counts are about half, because when we did the single-end counting, we were counting each read instead of each fragment.

Footnotes

Methods for counting reads which overlap features

Bioconductor packages:

summarizeOverlaps in the GenomicAlignments package

http://www.bioconductor.org/packages/devel/bioc/html/GenomicAlignments.html

featureCounts in the Rsubread package

Liao Y, Smyth GK, Shi W., “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics. 2014 http://www.ncbi.nlm.nih.gov/pubmed/24227677 http://bioinf.wehi.edu.au/featureCounts/

easyRNAseq package

Delhomme N1, Padioleau I, Furlong EE, Steinmetz LM. “easyRNASeq: a bioconductor package for processing RNA-Seq data.” Bioinformatics. 2012. http://www.ncbi.nlm.nih.gov/pubmed/22847932 http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html

Command line tools:

htseq-count, a program in the htseq Python package

Simon Anders, Paul Theodor Pyl, Wolfgang Huber. HTSeq — A Python framework to work with high-throughput sequencing data bioRxiv preprint (2014), doi: 10.1101/002824

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

bedtools https://code.google.com/p/bedtools/
bedops https://code.google.com/p/bedops/

Licence

References

https://github.com/genomicsclass

Exploratory data analysis for next generation sequencing

Thu, 25 Sep 2014 20:18:09 +0200

Download data
Exploratory Data Analysis for NGS
- Histogram
- MA plot
Licence
References

This analysis was performed using R (ver. 3.1.0).

Download data

We’ll use a data from Bottomly et al., sequencing two strains of mouse with many biological replicates. This dataset and a number of other sequencing datasets have been compiled from raw data into read counts tables by Frazee, Langmead, and Leek as part of the ReCount project. These datasets are made publicly available at the following website: http://bowtie-bio.sourceforge.net/recount/.

We can make similar figures for NGS to the ones shown in the previous sections for microarray data. However, the log transform does not work because RNAseq data contains many 0s. One quick way to get around this is by adding a constant (for example 0.5) before taking the log. A typical one is 0.5 which gives us a log2 value of -1 for 0s.

#Download data
destfile="~/hubiC/Documents/R/doc/english/genomics/bottomly_eset.RData"
if (!file.exists(destfile)) download.file("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData", destfile)
#Load the expression eset
load(destfile)
bottomly.eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 36536 features, 21 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: SRX033480 SRX033488 ... SRX033494 (21 total)
##   varLabels: sample.id num.tech.reps ... lane.number (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
##     ENSMUSG00000090268 (36536 total)
##   fvarLabels: gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

#The counts for each sample of reads which align to the first gene.
library("Biobase")
exprs(bottomly.eset)[1,]

## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482 SRX033490 SRX033483 
##       369       744       287       769       348       803       433 
## SRX033476 SRX033478 SRX033479 SRX033472 SRX033473 SRX033474 SRX033475 
##       469       585       321       301       461       309       374 
## SRX033491 SRX033484 SRX033492 SRX033485 SRX033493 SRX033486 SRX033494 
##       781       555       820       294       758       419       857

#Phenotypic data
pData(bottomly.eset)

##           sample.id num.tech.reps   strain experiment.number lane.number
## SRX033480 SRX033480             1 C57BL/6J                 6           1
## SRX033488 SRX033488             1 C57BL/6J                 7           1
## SRX033481 SRX033481             1 C57BL/6J                 6           2
## SRX033489 SRX033489             1 C57BL/6J                 7           2
## SRX033482 SRX033482             1 C57BL/6J                 6           3
## SRX033490 SRX033490             1 C57BL/6J                 7           3
## SRX033483 SRX033483             1 C57BL/6J                 6           5
## SRX033476 SRX033476             1 C57BL/6J                 4           6
## SRX033478 SRX033478             1 C57BL/6J                 4           7
## SRX033479 SRX033479             1 C57BL/6J                 4           8
## SRX033472 SRX033472             1   DBA/2J                 4           1
## SRX033473 SRX033473             1   DBA/2J                 4           2
## SRX033474 SRX033474             1   DBA/2J                 4           3
## SRX033475 SRX033475             1   DBA/2J                 4           5
## SRX033491 SRX033491             1   DBA/2J                 7           5
## SRX033484 SRX033484             1   DBA/2J                 6           6
## SRX033492 SRX033492             1   DBA/2J                 7           6
## SRX033485 SRX033485             1   DBA/2J                 6           7
## SRX033493 SRX033493             1   DBA/2J                 7           7
## SRX033486 SRX033486             1   DBA/2J                 6           8
## SRX033494 SRX033494             1   DBA/2J                 7           8

Exploratory Data Analysis for NGS

Something which distinguishes the sequencing experiments from the microarray experiments is that in the sequencing datasets we’ll have a number of features which, for a given sample, a value of exactly zero. Meaning that they were zero reads aligning to that feature for that sample.

Histogram

For every sample, I’m going to make a smooth histogram of these log2 plus 1/2 values.

#Log transformation
Y <- log2(exprs(bottomly.eset) + 0.5)
# library(devtools)
# install_github("rafalib","ririzarr")
library("rafalib")
#smooth histogram of the log2 of the counts for each sample
#different color for each line. And the add just says that only create
#a plot for the first sample and then the rest you should add.
for(i in 1:ncol(Y)){
  shist(Y[,i],unit=0.25,col=i,plotHist=FALSE,add=i!=1)
}

If we get rid of the zeros (i.e., those with log2 value of -1), we can more easily see that shape of the distribution for the expressed genes:

for(i in 1:ncol(Y)){
  idx <- Y[,i] > -1
  shist(Y[idx,i],unit=0.25,col=i,plotHist=FALSE,add=i!=1)
}

Plotting two samples against each other shows the spreading of points at the low end of expression from the log transformation. This can also be seen with randomly generated Poisson data.

#Remove rows with zero value and make the plot
idx <- rowSums(Y[,1:2]) > 0
plot(Y[idx,1], Y[idx,2], cex=.1)
rm <- rowMeans(2^Y[idx,1:2])
simulated1 <- rpois(length(idx), rm)
simulated2 <- rpois(length(idx), rm)
plot(log2(simulated1 + .5), log2(simulated2 + .5), cex=.1)

MA plot

The MA plot is again easier to look at, in that we don’t have to rotate our heads sideways by 45 degrees to see deviations from the diagonal. The MA plot, is the log ratio on the y-axis and the average of the log values on the x-axis. It gives us a lot of the same information as this scatter plot. But it is effectively rotated 45 degrees down.

maplot(Y[idx,1],Y[idx,2])

Licence

References

https://github.com/genomicsclass

Mapping algorithms and softwares

Thu, 25 Sep 2014 20:05:10 +0200

FASTQ file
Mapping Algorithms and Softwares
Licence
References

This analysis was performed using R (ver. 3.1.0).

FASTQ file

FASTQ file, is the result of a next-generation sequencing experiment. Every four lines, indicates a read. The first line gives the name of the read and some other information, including its length. The second line is the base pairs of the read. The fourth line contains information about the quality of each base pair. The third line links all the qualities to the first line.

FASTQ Format

A FASTQ file normally uses four lines per sequence.

Line 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

A FASTQ file containing a single sequence might look like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

source : wiki

Mapping Algorithms and Softwares

Mapping software is also called alignment software. One of the most popular short read aligners is Bowtie. Help can be found on Bowtie website or from linux command: bowtie –help.

Download sequencing data and extract the FASQ files

Download RNA sequencing reads from Short Read Archive (SRA) :

GSE52166, SRP032775, http://www.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1177756.

wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR117/SRR1177756/SRR1177756.sra

We use the software fastq-dump to extract the FASTQ files from the .sra file :

fastq-dump --split-file-3 SRR1177756.sra
# view generated files with size
ls -lh *.fastq

The option --split-file-3 is used for paired-end sequencing.

Two FastQ files are generated (SRR1177756_1.fastq, SRR1177756_2.fastq), because data is a paired-end sequencing. Each read is paired with the same name in each file.

Align RNA sequencing data using Tophat

TopHat software is used to align RNA sequencing data. It takes care of the fact that the reads might span an intron. The TopHat software actually uses Bowtie internally to do part of the mapping.

To use TopHat, you need a Bowtie index, which is a precompiled version of the genome, which is efficiently built so that you can map these short reads against it. Bowtie index can be downloaded for different species on TopHat website.

Download reference genome from tophat website:

wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/Ensembl/GRCh37/Homo_sapiens_Ensembl_GRCh37.tar.gz
tar zxvf Homo_sapiens_Ensembl_GRCh37.tar.gz

Running TopHat:

Tophat needs the path to reference genome. This path might be different on your computer.

tophat2 -o SRR117756_out -p 10 genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index SRR1177756_1.fastq  SRR1177756_2.fastq

The 2 files indicate left and right paired-end reads.

In the call to tophat2, the option -o specifies the output directory, -p specifies the number of threads to use (this may affect run times and can vary depending on the resources available).

Use bioinformatic cluster to align reads

A bioinformatic cluster is needed to process huge data generated by next generation sequencing. I use genotool plateform to run my alignement experiment. The aim of the following section is to show, how to submit a job to genotoul. You can skip this section

Submit job to genotoul

My working directory hierarchy:

work
akassambara
- seq
- genomes : contains reference genome
- SRR117756_out
- SRR1177756_1.fastq
- SRR1177756_2.fastq
- SRR1177756.sra
- tophat_SRR1177756.sh

Preparing my script (tophat_SRR1177756.sh) to run on genotool:

#!/bin/bash
export BOWTIE2_INDEXES=/work/akassambara/seq/genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index
tophat2 -o SRR117756_out -p 10 genomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome SRR1177756_1.fastq SRR1177756_2.fastq

Submit the job

qsub -q workq -l h_vmem=100G -l mem=100G tophat_SRR1177756.sh

Tophat result files

TopHat generates a list of files which are located in SRR117756_out folder.

akassambara@genotoul ~/work/seq/SRR117756_out $ ls -lh
total 4,1G
-rw-r--r-- 1 akassambara U1040 2,2G 30 juil. 05:12 accepted_hits.bam
-rw-r--r-- 1 akassambara U1040 9,5M 30 juil. 04:57 deletions.bed
-rw-r--r-- 1 akassambara U1040 7,4M 30 juil. 04:57 insertions.bed
-rw-r--r-- 1 akassambara U1040  11M 30 juil. 04:57 junctions.bed
drwxr-xr-x 2 akassambara U1040  16K 30 juil. 05:12 logs
-rw-r--r-- 1 akassambara U1040  188 29 juil. 16:59 prep_reads.info
-rw-r--r-- 1 akassambara U1040 1,9G 30 juil. 05:24 unmapped.bam

accepted_hits.bam file

This is a compressed file. It contains the reads which could be successfully mapped to the genome. You can use the samtools view function to read this compressed file.

The following line code can be used to view the first 1000 lines. Use “return” to view more and type q to quit.

#Look at the first 1000 lines, and then use less.
#You have to type q to quit less command
samtools view accepted_hits.bam | head -1000 | less

SRR1177756.30816791     393     1       12006   0       101M    *       0       0       GCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGAT   BBBBF


This gives each read with the name of the read on the left, and where it was possible to align the read to the genome. For example, the read SRR1177756.30816791 was aligned to chromosome 1 at position 12,006. We could see that this whole read aligned (101 matches).
Down below, we see that there are reads where the whole read didn't align. If we consider that we're looking at RNA sequencing, and that some of these reads come from RNA pieces, which span junction in the genome, it makes sense that there would be parts of these which wouldn't align to the genome. So instead of a match, we get an N, which indicates there was a gap: 10M385N91M = 10 matches then 385 nucleotide gaps then 91 matches. 
The following SAMtools calls can be used to process the BAM files.
samtools sort -n accepted_hits.bam accepted_hits_name_sorted
samtools sort -n unmapped.bam unmapped_name_sorted
samtools merge -n all_reads.bam accepted_hits_name_sorted unmapped_name_sorted
The -n option indicates that I want to sort these reads, not by genomic location but, by name of the read. Sorted reads are put in the accepted_hits_name_sorted file.
Sorted reads:

After sorting, you can see something interesting, which is that the fragment number 4 only has one read in this file. And also you can see that fragment number 1, and 6, and 7 are also not in this file. If we want to find those reads, we can look into the unmapped.bam.
If we want a single file with all of the alignments in it, we can use the samtools merge command.



Licence
Licence


References
https://github.com/genomicsclass



High-throughput Sequencing
Thu, 25 Sep 2014 17:49:20 +0200
-=-