Introduction to RNA sequencing

This article has been updated, you are now consulting an old release of this article!

Principle
Genes aren’t just a piece of the genome
- Alignement and read counting
Licence
References

This analysis was performed using R (ver. 3.1.0).

Principle

RNA extraction conversion to cDNA fragmentation sequencing this gives reads map to the genome.

We want to know, genes differentially expressed between 2 samples, 1 and 2. Each sample has a series of reads, and we can just simply find how many map by aligning each sample to the genome. In the cartoon down below, we see that sample 1 has more gene expression for gene A and B, than sample 2, and you can do statistics on that.

RNA sequencing principle (source : Thomas Girke, Analysis of RNA-Seq Data with R/Bioconductor, December 14, 2013)

After counting the number of reads which map to the reference genome, we have a big matrix with samples on the columns and gene on the rows. You can apply very similar approaches to the ones that we described earlier for microarray data. However most of the approaches that are currently used to analyze read counts, model the counts with Poisson or negative binomial. Two widely used methods that do this are edgeR and DESeq.

Genes aren’t just a piece of the genome

For humans, genes conatin exons and introns. So the sequence of the RNA transcript aren’t exactly represented on the genome, and in this cartoon we see what happens.

Genes source - wiki

Yellow color is exons and black color is intron. When RNA is transcribed, these introns are spliced out, and we create an RNA transcript that only contains the exons. This means that, we have these regions, called junctions.

For the reads which come from junctions: one part of it maps to one exon and the other half maps to the other exon.

A second complication is that the same gene can have many different transcripts. So one of the new possibilities that RNA-seq opens up, is the ability to measure these different transcripts. You might have a sample where, even though the gene expression for a given gene is the same in the two samples of the two populations, in one population you see an over-representation of one of the transcripts compared to the other.

While we have approaches like edgeR, DESeq, and DEXSeq that basically count each gene or each exon and quantifies the amount, we also have techniques that try to determine from the reads what transcripts are present. Examples of these methods are Trinity Oases, Cufflinks, and Scripture.

Alignement and read counting

You have to map the reads to the genome, taking into account that some of these reads might be split between two distant parts of the genome. There’s a gap. After alignment, we know which read comes from which exon and which reads come from which junction. So example of software that does that is TopHat, GSNAP, MapSplice, STAR.

The output of alignment softwares is bam file. To quantify how much of each transcript we have, software like Cufflinks can be used for reads counting.

You can also use this information to create the gene count tables. You can basically count all the reads that fall into a gene, regardless of what transcript they come from.

You could also do counts for at the exon level. This creates the table where you can then do things like differential expression. You have a table of counts for either exons or genes on the rows, and samples on the columns.

Licence

References

https://github.com/genomicsclass

Enjoyed this article? I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!