Easy Guides

IGV - Integrative Genomics Viewer

Tue, 07 Oct 2014 16:13:55 +0200

Load packages and data
IGV - Integrative Genomics Viewer
Video : Visualizing NGS data with IGV
Footnotes
Licence
References

This analysis was performed using R (ver. 3.1.0).

Load packages and data

We’re going to use the RNA sequencing experiment in the passilaBamSubset package.

#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
#Genes annotated in transcript database
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
#The 2 files of interest
fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()

IGV - Integrative Genomics Viewer

IGV is freely available for download here.

Copy fl1 and fl2 from the R library directory to the current working directory.

We need to use the `Rsamtools` library to index the BAM files for using IGV.

file.copy(from=fl1,to=basename(fl1))

## [1] FALSE

file.copy(from=fl2,to=basename(fl2))

## [1] FALSE

library(Rsamtools)
indexBam(basename(fl1))

##       untreated1_chr4.bam 
## "untreated1_chr4.bam.bai"

indexBam(basename(fl2))

##       untreated3_chr4.bam 
## "untreated3_chr4.bam.bai"

Note that if you have trouble downloading IGV, another option for visualization is the UCSC Genome Browser: http://genome.ucsc.edu/cgi-bin/hgTracks 
  
The UCSC Genome Browser is a great resource, having many tracks involving gene annotations, conservation over multiple species, and the ENCODE epigentic tracks is already available. However, the UCSC Genome Browser requires that you upload your genomic files to their server, or put your data on a publicly available server. This is not always possible if you are working with confidential data.

Using IGV, look the gene lgs.

In this example the data is an RNA sequencing experiment of Drosophila. From IGV, we need to use the Drosophila melanogaster genome, and specifically the dm3 genome.

Load the 2 bam files: File -> load from file -> select the 2 bam files. Two new tracks are created (one track for each file).

The passilaBamSubset package is a subset of the reads, which map to chromosome 4. Select chromosome 4, type the gene name lgs in the search field, and click Go. We can see that the coverage only is on the exons, mostly.

And if we zoom in (hold Shift, and drag on the top part to zoom) you can see the individual reads. There are reads on the plus strand and minus strand. IGV is a very useful program for visualizing quickly reads from sequencing experiments.

Legend: 1 = Reads on the plus strand 2 = Reads on the minus strand 3 = Mismatch base compared to the reference genome

Video : Visualizing NGS data with IGV

(French)

In the next unit, I'm going to show how to generate coverage plots-- like the ones in here-- but within Bioconductor.

Footnotes

IGV https://www.broadinstitute.org/igv/home
UCSC Genome Browser: zooms and scrolls over chromosomes, showing the work of annotators worldwide http://genome.ucsc.edu/

Licence

References

https://github.com/genomicsclass

Gviz - Visualize genomic data

Mon, 29 Sep 2014 11:11:36 +0200

This analysis was performed using R (ver. 3.1.0).

Introduction to Gviz package

The Gviz package aims to provide a structured visualization framework to plot any type of data along genomic coordinates. It also allows to integrate publicly available genomic annotation data from sources like UCSC or ENSEMBL.

Individual types of genomic features or data are represented by separate tracks, like most of genome browsers.

By default, Gviz checks all supplied chromosome names for validity in the sense of the UCSC definition (chromosomes have to start with the chr string). You may decide to turn this feature off by calling options(ucscChromosomeNames=FALSE)

In the following examples, we will make use of the UCSC genome and chromosome 7 (chr7) on mouse mm9 genome.

Plot annotation track

Please note that the AnnotationTrack constructor can accommodate many different types of inputs. For instance, the start and end coordinates of the annotation features could be passed in as individual arguments start and end, as a data.frame or even as an IRanges or GRangesList object.

library(Gviz)
library(GenomicRanges)
#Load data : class = GRanges
data(cpgIslands)
cpgIslands

## GRanges with 10 ranges and 0 metadata columns:
##        seqnames               ranges strand
##                         
##    [1]     chr7 [26549019, 26550183]      *
##    [2]     chr7 [26564119, 26564500]      *
##    [3]     chr7 [26585667, 26586158]      *
##    [4]     chr7 [26591772, 26593309]      *
##    [5]     chr7 [26594192, 26594570]      *
##    [6]     chr7 [26623835, 26624150]      *
##    [7]     chr7 [26659284, 26660352]      *
##    [8]     chr7 [26721294, 26721717]      *
##    [9]     chr7 [26821518, 26823297]      *
##   [10]     chr7 [26991322, 26991841]      *
##   ---
##   seqlengths:
##    chr7
##      NA

#Annotation track, title ="CpG"
atrack <- AnnotationTrack(cpgIslands, name = "CpG")
plotTracks(atrack)

Add genome axis track

This step is to indicate the genomic coordinates we are currently looking at.

## genomic coordinates
gtrack <- GenomeAxisTrack()
plotTracks(list(gtrack, atrack))

Add chromosome ideogram

To add chromosome ideogram, we have to indicate a valid UCSC genome (e.g : “hg19”) chromosome name (e.g : “chr7”).

Internet connection is required as the function fetches data from UCSC and it can take quite long time.

#genome : "hg19" 
gen<-genome(cpgIslands)
#Chromosme name : "chr7"
chr <- as.character(unique(seqnames(cpgIslands)))
#Ideogram track
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
plotTracks(list(itrack, gtrack, atrack))

Ideogram tracks are the one exception in all of Gviz 's track objects in the sense that they are not really displayed on the same coordinate system like all the other tracks. Instead, the current genomic location is indicated on the chromosome by a red box (or, as in this case, a red line if the width is too small to fit a box).

Add gene model

We can utilize gene model information from an existing local source. Alternatively, we could download such data from one of the available online resources like UCSC or ENSEBML, and there are constructor functions to handle these tasks.

For this example we are going to load gene model data from a stored data.frame.

#Load data
data(geneModels)
head(geneModels)

##   chromosome    start      end width strand feature            gene            exon      transcript     symbol
## 1       chr7 26591441 26591829   389      + lincRNA ENSG00000233760 ENSE00001693369 ENST00000420912 AC004947.2
## 2       chr7 26591458 26591829   372      + lincRNA ENSG00000233760 ENSE00001596777 ENST00000457000 AC004947.2
## 3       chr7 26591515 26591829   315      + lincRNA ENSG00000233760 ENSE00001601658 ENST00000430426 AC004947.2
## 4       chr7 26594428 26594538   111      + lincRNA ENSG00000233760 ENSE00001792454 ENST00000457000 AC004947.2
## 5       chr7 26594428 26596819  2392      + lincRNA ENSG00000233760 ENSE00001618328 ENST00000420912 AC004947.2
## 6       chr7 26594641 26594733    93      + lincRNA ENSG00000233760 ENSE00001716169 ENST00000457000 AC004947.2

#Plot
grtrack <- GeneRegionTrack(geneModels, genome = gen,
                           chromosome = chr, name = "Gene Model")
plotTracks(list(itrack, gtrack, atrack, grtrack))

Zoom the plot

We often want to zoom in or out on a particular plotting region to see more details or to get a broader overview.

#Use from and to arguments to zoom
plotTracks(list(itrack, gtrack, atrack, grtrack),
           from = 26700000, to = 26750000)

# Use extend.left and extend.right to zoom
#those arguments are relative to the currently displayed ranges, 
#and can be used to quickly extend the view on one or both ends of the plot.
plotTracks(list(itrack, gtrack, atrack, grtrack),
           extend.left = 0.5, extend.right = 1000000)

# to drop the bounding borders of the exons and 
# to have a nice plot
plotTracks(list(itrack, gtrack, atrack, grtrack),
           extend.left = 0.5, extend.right = 1000000, col = NULL)

A value of 0.5 will cause zooming in to half the currently displayed range.

Add sequence track and zoom to view sequence

The necessary sequence information is drawn from one of the BSgenome packages.

library(BSgenome.Hsapiens.UCSC.hg19)
strack <- SequenceTrack(Hsapiens, chromosome = chr)
plotTracks(list(itrack, gtrack, atrack, grtrack,
                strack), from = 26591822, to = 26591852, cex = 0.8)

Add data track

DataTrack object are essentially run-length encoded (Rle) numeric vectors or matrices, and we can use them to add all sorts of numeric data to our genomic coordinate plots. Different visualization options for these tracks are available including dot plots, histograms and box-and-whisker plots.

#For demonstration purposes we can create a simple DataTrack object from
#randomly sampled data.
set.seed(255)
lim <- c(26700000, 26750000)
coords <- sort(c(lim[1], sample(seq(from = lim[1], 
                                    to = lim[2]), 99), lim[2]))
dat <- runif(100, min = -10, max = 10)
head(dat)

## [1]  3.5382 -9.3984 -2.8372 -1.9871  0.2722  8.2508

##data track
dtrack <- DataTrack(data = dat, start = coords[-length(coords)],
                    end = coords[-1], chromosome = chr, genome = gen,
                    name = "Uniform")
##Plot data track
plotTracks(list(itrack, gtrack, atrack, grtrack,
                dtrack), from = lim[1], to = lim[2])

#Change plot type to histogram
plotTracks(list(itrack, gtrack, atrack, grtrack,dtrack),
           from = lim[1], to = lim[2], type = "histogram")

Such a visualization can be particularly helpful when displaying for instance the coverage of NGS reads along a chromosome, or to show the measurement values of mapped probes from a micro array experiment.

Plotting parameters

setting parameters

#Annotation of transcript
#Change panel and title background color
grtrack <- GeneRegionTrack(geneModels, genome = gen,
                           chromosome = chr, name = "Gene Model", 
                           transcriptAnnotation = "symbol",
                           background.panel = "#FFFEDB",
                           background.title = "darkblue")
plotTracks(list(itrack, gtrack, atrack, grtrack))

Plotting direction

By default all tracks will be plotted in a 5’ -> 3’ direction. It sometimes can be useful to actually show the data relative to the opposite strand.

plotTracks(list(itrack, gtrack, atrack, grtrack),
           reverseStrand = TRUE)

As you can see, the fact that the data has been plotted on the reverse strand is also reflected in the GenomeAxis track.

Track classes

Several parameters can be used to change the appearance of the different tracks.

Display parameters for GenomeAxisTrack objects

Set the position of labels to below, show IDs, change color

axisTrack <- GenomeAxisTrack(range = IRanges(start = c(2000000,4000000), 
                                             end = c(3000000, 7000000),
                                             names = rep("N-stretch", 2))
                             )
plotTracks(axisTrack, from = 1000000, to = 9000000, 
           labelPos = "below",showId=TRUE, col="red")

IdeogramTrack

#Ideogram
ideoTrack <- IdeogramTrack(genome = "hg19", chromosome = "chrX")
plotTracks(ideoTrack, from = 85000000, to = 129000000)

#Show chromosome band ID
plotTracks(ideoTrack, from = 85000000, to = 129000000,
           showId = FALSE, showBandId = TRUE, cex.bands = 0.4)

DataTrack

Essentially they constitute run-length encoded numeric vectors or matrices associated to a particular genomic coordinate range. There can be multiple samples in a single data set, and the plotting method provides tools to incorporate sample group information.

Thus the starting point for creating DataTrack objects will always be a set of ranges, either in the form of an IRanges or GRanges object, or individually as start and end coordinates or widths. The second ingredient is a numeric vector of the same length as the number of ranges, or a numeric matrix with the same number of columns.

We will load our sample data from an GRanges object that comes as part of the Gviz package.

#Load data
 data(twoGroups)
head(twoGroups)

## GRanges with 6 ranges and 6 metadata columns:
##       seqnames     ranges strand |           control         control.1          control.2            treated
##               |                                      
##   [1]     chrX [  1,  30]      * | -8.96125989500433 -7.65790161676705   9.87956526689231  -5.84375557024032
##   [2]     chrX [ 42,  71]      * |  -4.2114706709981   4.6882571419701   -1.0533055011183   1.03083667811006
##   [3]     chrX [ 84, 113]      * |  2.28711236733943  8.01326935179532   -7.1219984581694  -4.46718293242157
##   [4]     chrX [125, 154]      * |  9.20983788557351 -6.23242623638362   8.59682233538479  -6.32041404955089
##   [5]     chrX [167, 196]      * | 0.406841854564846 -7.05442394595593 -0.551973707042634   9.36362744309008
##   [6]     chrX [209, 238]      * |  5.90989288408309 -5.10347711388022   1.56467542983592 -0.488725560717285
##               treated.1         treated.2
##                        
##   [1]  9.71352839842439  9.99328563921154
##   [2] -6.77430204115808 0.593712376430631
##   [3] -4.05887754634023  8.05319488979876
##   [4] -1.56806231010705   3.5114610241726
##   [5] -4.88056596834213  1.55288028530777
##   [6]  6.99816173873842 -2.03484911937267
##   ---
##   seqlengths:
##    chrX
##      NA

#Plot data track
 dTrack <- DataTrack(twoGroups, name = "uniform")
 plotTracks(dTrack)

The default visualization for DataTrack is a dot plot.

The different plot types

The possible plot types are :

#dotplot
plotTracks(DataTrack(twoGroups, name = "p"), type="p")
#lines plot
plotTracks(DataTrack(twoGroups, name = "l"), type="l")
#line and dot plot
plotTracks(DataTrack(twoGroups, name = "b"), type="b")
#lines plot of average
plotTracks(DataTrack(twoGroups, name = "a"), type="a")
#histogram lines
plotTracks(DataTrack(twoGroups, name = "h"), type="h")
#histogram histogram (bar width equal to range with)
plotTracks(DataTrack(twoGroups, name = "histogram"), type="histogram")
#'polygon-type' plot relative to a baseline
plotTracks(DataTrack(twoGroups, name = "polygon"), type="polygon")
#box and whisker plot
plotTracks(DataTrack(twoGroups, name = "boxplot"), type="boxplot")
#false color image of the individual values
plotTracks(DataTrack(twoGroups, name = "heatmap"), type="heatmap")

Example of DataTrack plots :

#Combine a boxplot with an average line and a data grid (g):
plotTracks(dTrack, type = c("boxplot", "a", "g"))

#Heatmap and show sample names
plotTracks(dTrack, type = c("heatmap"), showSampleNames = TRUE,
           cex.sampleNames = 0.6)

Data grouping

The individual samples can be grouped together using a factor variable.

plotTracks(dTrack, groups = rep(c("control", "treated"), each = 3),
           type = c("a", "p"), legend=TRUE)
#Boxplot
plotTracks(dTrack, groups = rep(c("control", "treated"), each = 3),
           type = "boxplot")

#Aggregate group. aggregation can be  mean, median, extreme,
#sum, min and max
plotTracks(dTrack, groups = rep(c("control", "treated"),each = 3),
           type = c("b"), aggregateGroups = TRUE,
           aggregation = "max")

Building DataTrack objects from files

The DataTrack class supports the most common file types like wig, bigWig, bedGraph and bam files.

bgFile <- system.file("extdata/test.bedGraph", package = "Gviz")
dTrack2 <- DataTrack(range = bgFile, genome = "hg19",
                     type = "l", chromosome = "chr19", name = "bedGraph")
plotTracks(dTrack2)

Note that the
Gviz package is using functionality from the rtracklayer package for most of the file import operations

The real power of the file support in the Gviz package comes with streaming from indexed files. Only the relevant part of the data has to be loaded during the plotting operation, so the underlying data files may be quite large without decreasing the performance or causing too big of a memory footprint.

We will exemplify this feature here using a small bam file that is provided with the package. bam files contain alignments of sequences (typically from a next generation sequencing experiment) to a common reference. The most natural representation of such data in a DataTrack is to look at the alignment coverage at a given position only and to encode this in a single elementMetadata column.

bamFile <- system.file("extdata/test.bam", package = "Gviz")
dTrack4 <- DataTrack(range = bamFile, genome = "hg19",
                     type = "l", name = "Coverage", window = -1, chromosome = "chr1")
plotTracks(dTrack4, from = 189990000, to = 190000000)

AnnotationTrack

Essentially they consist of one or several genomic ranges that can be grouped into composite annotation elements if needed. The necessary building blocks are the range coordinates, a chromosome and a genome identifier. Information can be passed to the function, either in the form of separate function arguments, as IRanges, GRanges or data.frame objects.

aTrack <- AnnotationTrack(start = c(10, 40, 120),
                          width = 15, chromosome = "chrX", strand = c("+","*", "-"),
                          id = c("Huey", "Dewey", "Louie"),
                          genome = "hg19", name = "foo")
plotTracks(aTrack)

Building AnnotationTrack objects from files

The default import function reads the coordinates of all the sequence alignments from the bam file.

#Annotation track
aTrack2 <- AnnotationTrack(range = bamFile, genome = "hg19",
                           name = "Reads", chromosome = "chr1")
plotTracks(aTrack2, from = 189995000, to = 190000000)

We can now plot both the DataTrack representation as well as the AnnotationTrack representation of the bam file together to prove that the underlying data are indeed identical.

plotTracks(list(dTrack4, aTrack2), from = 189990000,
           to = 190000000)

GeneRegionTrack

GeneRegionTrack objects are in principle very similar to AnnotationTrack objects. The only difference is that they are a little more gene/transcript centric. We need to pass start and end positions (or the width) of each annotation feature in the track and also supply the exon, transcript and gene identifiers for each item which will be used to create the transcript groupings.

A somewhat special case is to build a GeneRegionTrack object directly from one of the popular TranscriptDb objects, an option that is treated in more detail below.

data(geneModels)
grtrack <- GeneRegionTrack(geneModels, genome = gen,
                           chromosome = chr, name = "foo", 
                           transcriptAnnotation = "symbol")

Building GeneRegionTrack objects from TranscriptDbs

The GenomicFeatures packages provides an elegant framework to download gene model information from online sources and to store it locally in a SQLite data base.

A nice bonus when building GeneRegionTracks from TranscriptDb objects is that we get additional information about coding and non-coding regions of the transcripts, i.e., coordinates of the 5’ and 3’ UTRs and of the CDS regions.

library(GenomicFeatures)
samplefile <- system.file("extdata", "UCSC_knownGene_sample.sqlite",
                          package = "GenomicFeatures")
txdb <- loadDb(samplefile)
txTr <- GeneRegionTrack(txdb, chromosome = "chr6", start = 300000, end = 350000)
#feature(txTr)
plotTracks(txTr)

BiomartGeneRegionTrack

BiomartGeneRegionTrack class, provides a direct interface to the ENSEMBL Biomart service. We just enter a genome, chromosome and a start and end position on this chromosome, and the constructor function BiomartGeneRegionTrack will automatically contact ENSEMBL, fetch the necessary information and build the gene model on the fly.

Please note that you will need an internet connection for this to work, and that contacting Biomart can take a significant amount of time depending on usage and network trafic.

biomTrack <- BiomartGeneRegionTrack(genome = "hg19",
                                    chromosome = chr, start = 20000000, end = 21000000,
                                    name = "ENSEMBL")
plotTracks(biomTrack, col.line = NULL, col = NULL)

Sequence Track

library(BSgenome.Hsapiens.UCSC.hg19)
sTrack <- SequenceTrack(Hsapiens)
#sequence track : add 5'->3'
plotTracks(sTrack, chromosome = 1, from = 20000,to = 20050,
           add53=TRUE)
#The complement
plotTracks(sTrack, chromosome = 1, from = 20000,to = 20050,
           add53=TRUE, complement = TRUE)

AlignmentsTrack

Plots of aligned sequences, typically from next generation sequencing experiments can be quite helpful, for instance when visually inspecting the validity of a called SNP. Those alignments are usually stored in BAM files.

RNAseq experiment

For this demonstration let’s use a small BAM file for which paired NGS reads have been mapped to an extract of the human hg19 genome. The data originate from an RNASeq experiment, and the alignments have been performed using the STAR aligner allowing for gaps. We also download some gene annotation data for that region from Biomart.

afrom=2960000
ato=3160000
#bam file
alTrack <- AlignmentsTrack(system.file(package = "Gviz", "extdata", "gapped.bam"), isPaired = TRUE)
bmt <- BiomartGeneRegionTrack(genome = "hg19", chromosome = "chr12",
                              start = afrom, end = ato, filter = list(with_ox_refseq_mrna = TRUE),
                              stacking = "dense")
plotTracks(c(bmt, alTrack), from = afrom, to = ato, chromosome = "chr12")

Now this already shows us the general layout of the track: on top we have a panel with the read coverage information in the form of a histogram, and below that a pile-up view of the individual reads. There is
only a certain amount of vertical space available for the plotting, and not the whole depth of the pile-up
can be displayed here. This fact is indicated by the white downward-pointing arrows in the title panel. We
could address this issue by playing around with the max.height, min.height or stackHeight display parameters which all control the height or the vertical spacing of the stacked reads. Or we could reduce the size of the coverage section by setting the coverageHeight or the minCoverageHeight parameters.

plotTracks(c(bmt, alTrack), from = afrom, to = ato,
           chromosome = "chr12", min.height = 0, coverageHeight = 0.08,
           minCoverageHeight = 0)

From that far out the pile-ups are not particularly useful, and we can turn those off by setting the type display parameter accordingly.

plotTracks(c(alTrack, bmt), from = afrom, to = ato, chromosome = "chr12", type = "coverage")

Let’s zoom in a bit further to check out the details of the pile-ups section:

plotTracks(c(bmt, alTrack), from = afrom + 12700,
           to = afrom + 15200, chromosome = "chr12")

The direction of the individual reads is indicated by the arrow head, and read pairs are connect by a bright gray line. Gaps in the alignments are show by the connecting dark gray lines. On devices that support transparancy we can also see that some of the read pairs are actually overlapping.

As mentioned before we can control whether the data should be treated as paired end or single end data by setting the isPaired argument in the constructor. Here is how we could take a look at the data in the same file, but in single end mode.

alTrack <- AlignmentsTrack(system.file(package = "Gviz",
                                       "extdata", "gapped.bam"), isPaired = FALSE)
plotTracks(c(bmt, alTrack), from = afrom + 12700,
           to = afrom + 15200, chromosome = "chr12")

DNAseq experiment

To better show the features of the AlignmentsTrack for sequence variants we will load a different data set, this time from a whole genome DNASeq SNP calling experiment. Again the reference genome is hg19 and the alignments have been performed using Bowtie2.

We need to tell the AlignmentsTrack about the reference genome (sequenceTrack).

afrom <- 44945200
ato <- 44947200
alTrack <- AlignmentsTrack(system.file(package = "Gviz","extdata", "snps.bam"), isPaired = TRUE)
plotTracks(c(alTrack, sTrack), chromosome = "chr21", from = afrom,to = ato)

The mismatched bases are indicated on both the individual reads in the pileup section and also in the coverage plot in the form of a stacked histogram.

When zooming in to one of the obvious heterozygous SNP positions we can reveal even more details.

#Zoom
plotTracks(c(alTrack, sTrack), chromosome = "chr21", from = 44946590, to = 44946660)

#show individual letters
plotTracks(c(alTrack, sTrack), chromosome = "chr21",
           from = 44946590, to = 44946660, cex = 0.5, min.height = 8)

Track highlighting and overlays

Highlighting

#highlight
ht <- HighlightTrack(trackList = list(atrack, grtrack,dtrack),
                     start = c(26705000, 26720000), width = 7000, chromosome = 7)
plotTracks(list(itrack, gtrack, ht), from = lim[1], to = lim[2])

Overlays

For certain applications it can make sense to overlay multiple tracks on the same area of the plot. For the purpose of an instructive example we will generate a second DataTrack object and combine it with the existing one from the second chapter.

#create data
dat <- runif(100, min = -2, max = 22)
dtrack2 <- DataTrack(data = dat, start = coords[-length(coords)],
                     end = coords[-1], chromosome = chr, genome = gen,
                     name = "Uniform2", groups = factor("sample 2",levels = c("sample 1", "sample 2")),
                     legend = TRUE)
displayPars(dtrack) <- list(groups = factor("sample 1",levels = c("sample 1", "sample 2")), legend = TRUE)
ot <- OverlayTrack(trackList = list(dtrack2, dtrack))
ylims <- extendrange(range(c(values(dtrack), values(dtrack2))))
plotTracks(list(itrack, gtrack, ot), from = lim[1], to = lim[2], ylim = ylims, type = c("smooth", "p"))

On devices that support it, alpha blending can be a useful tool to tease out even more information out of track overlays, at least when comparing just a small number of samples. The resulting transparency effectively eliminates the problem of overplotting. The following example will only work if this vignette has been built on a system with alpha blending support.

displayPars(dtrack) <- list(alpha.title = 1, alpha = 0.5)
displayPars(dtrack2) <- list(alpha.title = 1, alpha = 0.5)
ot <- OverlayTrack(trackList = list(dtrack, dtrack2))
plotTracks(list(itrack, gtrack, ot), from = lim[1],
           to = lim[2], ylim = ylims, type = c("hist"), window = 30)

Footnotes

Gviz http://www.bioconductor.org/packages/release/bioc/html/Gviz.html

Licence

ggbio - Visualize genomic data

Sun, 28 Sep 2014 11:49:42 +0200

Building your first track
Building your tracks
Simple navigation
Overview plots
- Circular plots
Footnotes
Licence

This analysis was performed using R (ver. 3.1.0).

ggbio is a package build on top of ggplot2() to visualize easily genomic data.

Building your first track

In this chapter, you will learn : ˆ1.How to add ideogram track. ˆ2. How to add gene model track. 4. How to add reference track 3. How to add track from bam files to visualize coverage and mismatch summary. 4. How to add track for vcf file to visualize variants

Add ideogram track : Plot single chromosome with cytoband

hg19, hg18, mm10, mm9  as been built inside, so you don't have download it on the fly.

library(ggbio)
#chr 1 is automatically drawn by default (subchr="chr1")
p.ideo <- Ideogram(genome = "hg19")
p.ideo
#Highlights a region on "chr2"
library(GenomicRanges)
p.ideo + xlim(GRanges("chr2", IRanges(1e8, 1e8+10000000)))

color and fill arguments can be used to change the color and the fill color of highlight region

Add gene model track

Gene model is composed of genetic features CDS, UTR, introns, exons and non-genetic region. ggbio supports three methods to make gene model track:

OrganismDb object: recommended, support gene symbols and other combination of columns as label. TranscriptDb object: don’t support gene symbol labeling. GRangesList object: flexible, if you don’t have annotation package available for the first two methods, you could prepare a data set parsed from gtf file, you can simply use it and plot it as gene model track.

In this section, we’ll show, how to make gene model from OrganismDb and from TranscriptDb objects. To make a gene model from GRangesList object, see the vignette of ggbio package.

Gene model from OrganismDb object

OrganismDb object has a simpler API to retrieve data from different annotation resources, so we could label our transcripts in different ways.

library(ggbio)
library(Homo.sapiens)

#load gene symbol : GRanges, one gene/row
data(genesymbol, package = "biovizBase")
#retrieve information of the gene of interest
wh <- genesymbol[c("BRCA1", "NBR1")]
wh <- range(wh, ignore.strand = TRUE)

#Plot the different transcripts  for our gene of interest
p.txdb <- autoplot(Homo.sapiens, which = wh)
p.txdb

#Change inton geometry, use gap.geom
autoplot(Homo.sapiens, which = wh, gap.geom = "chevron")

Different arguments to change colors : label.color (color of the label), color(line color) and fill (fill color of exons):

autoplot(Homo.sapiens, which = wh, label.color = "black", color = "brown",
fill = "brown")

Label could be turned off by setting it to FALSE, you could also use expression to make a flexible label combination from column names.

columns(Homo.sapiens)

##  [1] "GOID"         "TERM"         "ONTOLOGY"     "DEFINITION"  
##  [5] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
##  [9] "ACCNUM"       "ALIAS"        "CHR"          "CHRLOC"      
## [13] "CHRLOCEND"    "ENZYME"       "MAP"          "PATH"        
## [17] "PMID"         "REFSEQ"       "SYMBOL"       "UNIGENE"     
## [21] "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"    
## [25] "UNIPROT"      "GO"           "EVIDENCE"     "GOALL"       
## [29] "EVIDENCEALL"  "ONTOLOGYALL"  "OMIM"         "UCSCKG"      
## [33] "CDSID"        "CDSNAME"      "CDSCHROM"     "CDSSTRAND"   
## [37] "CDSSTART"     "CDSEND"       "EXONID"       "EXONNAME"    
## [41] "EXONCHROM"    "EXONSTRAND"   "EXONSTART"    "EXONEND"     
## [45] "GENEID"       "TXID"         "EXONRANK"     "TXNAME"      
## [49] "TXCHROM"      "TXSTRAND"     "TXSTART"      "TXEND"

#Flexible label
autoplot(Homo.sapiens, which = wh, columns = c("GENENAME", "GO"), names.expr = "GENENAME::GO")

Gene model from TranscriptDb object

TranscriptDb doesn't contain any gene symbol information, so we use tx id as default for label.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
autoplot(txdb, which = wh)

Add a reference track

To add a reference track, we need to load a BSgenome object from the annotation package. You can choose to plot the sequence as text, rect, segment.

You can pass a zoom in factor into zoom function, if it's over 1 it's zooming out, if it's smaller than 1 it's zooming in.

library(BSgenome.Hsapiens.UCSC.hg19)
bg <- BSgenome.Hsapiens.UCSC.hg19
p.bg <- autoplot(bg, which = wh)
## no geom
p.bg
## segment
p.bg + zoom(1/100)
## rectangle
p.bg + zoom(1/1000)
## text
p.bg + zoom(1/2500)

To override a zemantic zoom threshold, you simply provide a geom explicitly.

library(BSgenome.Hsapiens.UCSC.hg19)
bg <- BSgenome.Hsapiens.UCSC.hg19
## force to use geom 'segment' at this level
autoplot(bg, which = resize(wh, width = width(wh)/2000), geom = "segment")

Add an alignement track

Create a bam object

RSamtools package is required. The following code is just an example to create a bam object. This bam object can be used in autoplot function

bam<-BamFile(file="file.bam", index="file.bai")
#use it for autoplot
autoplot(bam, which = wh)

Visualize bam file

ggbio supports visualization of alignments file stored in bam, autoplot method accepts :

bam file path (indexed)
BamFile object

It's simple to just pass a file path to autoplot function, you can stream a chunk of region by providing 'which' parameter. Otherwise please use method 'estiamte' to show overall estiamted coverage.

fl.bam <- system.file("extdata", "wg-brca1.sorted.bam", package = "biovizBase")
#keeps only the seqlevels in value and removes all others
wh <- keepSeqlevels(wh, "chr17")
autoplot(fl.bam, which = wh)

Mismatch proportion

To show mismatch proportion, you have to provide reference sequence, the mismatched proportion is color coded in the bar chart.

library(BSgenome.Hsapiens.UCSC.hg19)
bg <- BSgenome.Hsapiens.UCSC.hg19
p.mis <- autoplot(fl.bam, bsgenome = bg, which = wh, stat = "mismatch")
p.mis

View all coverage distribution

To view overall estimated coverage distribution, please use method ‘estiamte’. ‘which’ parameter also accept characters. And there is a hidden value called ‘..coverage..’ to let you do simple transformation in aes().

#View all coverage distribution
autoplot(fl.bam, method = "estimate")
#Select chromosomes of interest
#Log transformation of coverage
autoplot(fl.bam, method = "estimate", 
         which = paste0("chr", 17:18),
         aes(y = log(..coverage..)))

Add a variants track : vizualize vcf file

This track is supported by semantic zoom.

To view your variants file, you could : - Import it using package** VariantAnntoation as VCF object, then use autoplot - Convert it in VRanges** object and use autoplot - Simply provide vcf file path in autoplot()

library(VariantAnnotation)
fl.vcf <- system.file("extdata", "17-1409-CEU-brca1.vcf.bgz", package="biovizBase")
vcf <- readVcf(fl.vcf, "hg19")
vr <- as(vcf[, 1:3], "VRanges")
vr <- renameSeqlevels(vr, value = c("17" = "chr17"))
## small region contains data
gr17 <- GRanges("chr17", IRanges(41234400, 41234530))
p.vr <- autoplot(vr, which = wh)

## none geom
p.vr

## rect geom
p.vr + xlim(gr17)
## text geom
p.vr + xlim(gr17) + zoom()

You can simply override geom

autoplot(vr, which = wh, geom = "rect", arrow = FALSE)

Building your tracks

gr17 <- GRanges("chr17", IRanges(41234415, 41234569))
tks <- tracks(p.ideo, mismatch = p.mis, dbSNP = p.vr, ref = p.bg, gene = p.txdb,
heights = c(2, 3, 3, 1, 4)) + xlim(gr17) + theme_tracks_sunset()
tks

Simple navigation

You could zoom in and zoom out, or go through view chunks one by one. - zoom: put a factor inside and you can zoom in or zoom out - nextView: switch to next view - prevView: switch to previous view

## zoom in
tks + zoom()
## zoom in with scale
p.txdb + zoom(1/8)
## zoom out
p.txdb + zoom(2)
## next view page
p.txdb + nextView()
## previous view page
p.txdb + prevView()

Don't forget xlim accept GRanges object (single row), so you could simply prepare a GRanges to store the region of interests and go through them one by one.

Overview plots

Overview is a good way to show all events at the same time, give overall summary statistics for the whole genome. In this chapter, we will introduce three different layouts that are used a lots in genomic data visualization.

Circular plots

We are going to visualize somatic mutation as segment.

- rule of thumb seqlengths, seqlevels and chromosomes names should be exactly the same. - to use circle, you have to use ggbio constructor at the beginning instead of ggplot.

All the raw data processed and stored in GRanges ready for use, you can simply load the sample data from biovizBase

Load data

#Load the data
data("CRC", package = "biovizBase")
head(hg19sub)

## GRanges with 6 ranges and 0 metadata columns:
##       seqnames         ranges strand
##                  
##   [1]        1 [1, 249250621]      *
##   [2]        2 [1, 243199373]      *
##   [3]        3 [1, 198022430]      *
##   [4]        4 [1, 191154276]      *
##   [5]        5 [1, 180915260]      *
##   [6]        6 [1, 171115067]      *
##   ---
##   seqlengths:
##            1         2         3 ...        20        21        22
##    249250621 243199373 198022430 ...  63025520  48129895  51304566

Create ideogram, label and scale track

The function layouts the circle by the order you created from inside to outside.

p <- ggbio() + circle(hg19sub, geom = "ideo", fill = "gray70") + #Ideogram
circle(hg19sub, geom = "scale", size = 2) + #Scale
circle(hg19sub, geom = "text", aes(label = seqnames), vjust = 0, size = 3) # label
p #print plot

Show somatic mutation

We add a “rectangle” track to show somatic mutation.

head(mut.gr)

## GRanges with 6 ranges and 10 metadata columns:
##       seqnames                 ranges strand | Hugo_Symbol Entrez_Gene_Id
##                           |          
##   [1]        1 [ 11003085,  11003085]      + |      TARDBP          23435
##   [2]        1 [ 62352395,  62352395]      + |       INADL          10207
##   [3]        1 [194960885, 194960885]      + |         CFH           3075
##   [4]        2 [ 10116508,  10116508]      - |        CYS1         192668
##   [5]        2 [ 33617747,  33617747]      + |     RASGRP3          25780
##   [6]        2 [ 73894280,  73894280]      + |     C2orf78         388960
##         Center NCBI_Build   Strand Variant_Classification Variant_Type
##                              
##   [1]    Broad         36        +               Missense          SNP
##   [2]    Broad         36        +               Missense          SNP
##   [3]    Broad         36        +               Missense          SNP
##   [4]    Broad         36        -               Missense          SNP
##   [5]    Broad         36        +               Missense          SNP
##   [6]    Broad         36        +               Missense          SNP
##       Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
##                                   
##   [1]                G                 G                 A
##   [2]                T                 T                 G
##   [3]                G                 G                 A
##   [4]                C                 C                 T
##   [5]                C                 C                 T
##   [6]                T                 T                 C
##   ---
##   seqlengths:
##            1         2         3 ...        20        21        22
##    249250621 243199373 198022430 ...  63025520  48129895  51304566

p <- ggbio() + circle(mut.gr, geom = "rect", color = "steelblue") + #somatic mutation
circle(hg19sub, geom = "ideo", fill = "gray70") +#Ideogram
circle(hg19sub, geom = "scale", size = 2) +#Scale
circle(hg19sub, geom = "text", aes(label = seqnames), vjust = 0, size = 3)#label
p

Many other examples are available in the ggbio package vignette

Footnotes

ggbio http://www.bioconductor.org/packages/release/bioc/html/ggbio.html

Licence

Visualize NGS data with R and Bioconductor

Fri, 26 Sep 2014 04:32:38 +0200

Load packages and data
Simple plot
Extracting the gene of interest using the transcript database
Gviz
ggbio
Footnotes
Licence
References

This analysis was performed using R (ver. 3.1.0).

Load packages and data

We’re going to use the RNA sequencing experiment in the passilaBamSubset package.

#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
#Genes annotated in transcript database
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
#The 2 files of interest
fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()

fl1 and fl2 are bam file from RNA sequencing data of Drosophila.

Simple plot

library(GenomicRanges)

Note: if you are using Bioconductor version 14, paired with R 3.1, you should also load the following library. You do not need to load this library, and it will not be available to you, if you are using Bioconductor version 13, paired with R 3.0.x.

library(GenomicAlignments)

We read in the alignments from the file fl1. Then we use the coverage function to tally up the base pair coverage. We then extract the subset of coverage which overlaps our gene of interest (lgs gene), and convert this coverage from an RleList into a numeric vector. Rle objects are compressed, such that repeating numbers are stored as a number and a length.

# x is a class of GAlignments, Each row is a read
x <- readGAlignments(fl1)
#Coverage of the reads : this will generate a RleList
#Rle is a run-length encoding (week2)
xcov <- coverage(x)
xcov

## RleList of length 8
## $chr2L
## integer-Rle of length 23011544 with 1 run
##   Lengths: 23011544
##   Values :        0
## 
## $chr2R
## integer-Rle of length 21146708 with 1 run
##   Lengths: 21146708
##   Values :        0
## 
## $chr3L
## integer-Rle of length 24543557 with 1 run
##   Lengths: 24543557
##   Values :        0
## 
## $chr3R
## integer-Rle of length 27905053 with 1 run
##   Lengths: 27905053
##   Values :        0
## 
## $chr4
## integer-Rle of length 1351857 with 122061 runs
##   Lengths:  891   27    5   12   13   45 ...    3  106   75 1600   75 1659
##   Values :    0    1    2    3    4    5 ...    6    0    1    0    1    0
## 
## ...
## <3 more elements>

#Extract one element of the list
#we have zero coverage for the first 891 base pairs,
#and then we have coverage of one for 27 base pairs, etc.
xcov$chr4

## integer-Rle of length 1351857 with 122061 runs
##   Lengths:  891   27    5   12   13   45 ...    3  106   75 1600   75 1659
##   Values :    0    1    2    3    4    5 ...    6    0    1    0    1    0

#Let's zoom in now to range which is near this gene of interest, LGS.
z <- GRanges("chr4",IRanges(456500,466000))

# only available for Bioconductor 2.14
xcov[z]#subset the coverage of the region of interest

## RleList of length 1
## $chr4
## integer-Rle of length 9501 with 1775 runs
##   Lengths: 1252   10   52    4    7    2 ...   10    7   12  392   75 1041
##   Values :    0    2    3    4    5    6 ...    3    2    1    0    1    0

# Equivalent in Bioconductor 2.13
#xcov$chr4[ranges(z)]# Works for all version of Bioconductor

#Plot of the coverage arround the region of interest
xnum <- as.numeric(xcov$chr4[ranges(z)])#Uncompress the coverage
plot(xnum)

We can do the same for another file: So because fl2 is a paired in sequencing experiment, we now have pairs of reads which corresponds to one fragment.

y <- readGAlignmentPairs(fl2)
ycov <- coverage(y)
ynum <- as.numeric(ycov$chr4[ranges(z)])
plot(xnum, type="l", col="blue", lwd=2)
lines(ynum, col="red", lwd=2)

We can zoom in on a single exon, between the area of 6 000 base pairs:

plot(xnum, type="l", col="blue", lwd=2, xlim=c(6200,6600))
lines(ynum, col="red", lwd=2)

Extracting the gene of interest using the transcript database

Extract information about gene of interest.

Suppose we are interested in visualizing the gene lgs. We can extract it from the transcript database TxDb.Dmelanogaster.UCSC.dm3.ensGene on Bioconductor, but first we need to look up the Ensembl gene name. We will use the functions that we learned in the previous chapter.

# biocLite("biomaRt")
library(biomaRt)
#load the drosophila ensemble gene BioMart.
m <- useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
lf <- listFilters(m)
lf[grep("name", lf$description, ignore.case=TRUE),]

##                             name
## 1                chromosome_name
## 12  with_flybasename_translation
## 15   with_flybasename_transcript
## 19         with_flybasename_gene
## 62              flybasename_gene
## 63        flybasename_transcript
## 64       flybasename_translation
## 86                 wikigene_name
## 98                go_parent_name
## 194               so_parent_name
##                                      description
## 1                                Chromosome name
## 12                with FlyBaseName protein ID(s)
## 15             with FlyBaseName transcript ID(s)
## 19                   with FlyBaseName gene ID(s)
## 62           FlyBaseName Gene ID(s) [e.g. cul-2]
## 63  FlyBaseName Transcript ID(s) [e.g. cul-2-RB]
## 64     FlyBaseName Protein ID(s) [e.g. cul-2-PB]
## 86                 WikiGene Name(s) [e.g. Ir21a]
## 98                              Parent term name
## 194                             Parent term name

#get the ensembl gene name
map <- getBM(mart = m,
  attributes = c("ensembl_gene_id", "flybasename_gene"),
  filters = "flybasename_gene", 
  values = "lgs")
map

##   ensembl_gene_id flybasename_gene
## 1     FBgn0039907              lgs

Now we extract the exons for each gene, and then the exons for the gene lgs.

#get the exons out of the transcript database.
library(GenomicFeatures)
grl <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, by="gene")
gene <- grl[[map$ensembl_gene_id[1]]]
#View the 6 exons of lgs gene
gene

## GRanges with 6 ranges and 2 metadata columns:
##       seqnames           ranges strand |   exon_id   exon_name
##                     |  
##   [1]     chr4 [457583, 459544]      - |     63350        
##   [2]     chr4 [459601, 459791]      - |     63351        
##   [3]     chr4 [460074, 462077]      - |     63352        
##   [4]     chr4 [462806, 463015]      - |     63353        
##   [5]     chr4 [463490, 463780]      - |     63354        
##   [6]     chr4 [463839, 464533]      - |     63355        
##   ---
##   seqlengths:
##        chr2L     chr2R     chr3L ...   chrXHet   chrYHet chrUextra
##     23011544  21146708  24543557 ...    204112    347038  29004656

Finally we can plot these ranges to see what it looks like:

#Plot each exon as an arrow
rg <- range(gene)
plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
       end(gene),rep(0,length(gene)),
       lwd=3, length=.1)

But actually, the gene is on the minus strand. We should add a line which corrects for minus strand genes:

If it’s a plus strand, then use code=2 and that means put the arrow head at the end. If it’s a minus strand gene, use code=1 and that means to put an arrow at the start.

plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
       end(gene),rep(0,length(gene)),
       lwd=3, length=.1, 
       code=ifelse(as.character(strand(gene)[1]) == "+", 2, 1))

In the next units we're going to continue. And I'm going to show two packages for visualizing
genomic data in Bioconductor which allow you to avoid rewriting all of this messy code every time you want to draw exons or coverage.

Gviz

We will briefly show two packages for visualizing genomic data in Bioconductor. Note that each of these have extensive vignettes for plotting many kinds of data. We will show here how to make the coverage plots as before:

#biocLite("Gviz")
library(Gviz)
#You set up a Genome Axis Track
gtrack <- GenomeAxisTrack()

#specify an Annotation Track
atrack <- AnnotationTrack(gene, name = "Gene Model")
plotTracks(list(gtrack, atrack))

The GVIZ package also allows you to draw data. We can look at the coverage, for instance. We have the coverage already as an RleList. In order to plot coverage using the GVIZ package, you need to turn the data into a GRanges object.

Gviz expects that data will be provided as GRanges objects, so we convert the RleList coverage to a GRanges object:

#Convert coverage to GRanges object
xgr <- as(xcov, "GRanges")
ygr <- as(ycov, "GRanges")
#So we have zero coverage for the first four chromosomes, and then it says,
#chromosome four starts out with 891 base pairs of zero coverage.
#And that was the same information that we had here, 891 base pairs of zero coverage.
xgr

## GRanges with 122068 ranges and 1 metadata column:
##            seqnames              ranges strand   |     score
##                               | 
##        [1]    chr2L       [1, 23011544]      *   |         0
##        [2]    chr2R       [1, 21146708]      *   |         0
##        [3]    chr3L       [1, 24543557]      *   |         0
##        [4]    chr3R       [1, 27905053]      *   |         0
##        [5]     chr4       [1,      891]      *   |         0
##        ...      ...                 ...    ... ...       ...
##   [122064]     chr4 [1350124,  1350198]      *   |         1
##   [122065]     chr4 [1350199,  1351857]      *   |         0
##   [122066]     chrM [      1,    19517]      *   |         0
##   [122067]     chrX [      1, 22422827]      *   |         0
##   [122068]  chrYHet [      1,   347038]      *   |         0
##   ---
##   seqlengths:
##       chr2L    chr2R    chr3L    chr3R     chr4     chrM     chrX  chrYHet
##    23011544 21146708 24543557 27905053  1351857    19517 22422827   347038

#create two datatracks
#plot coverage which overlap z
dtrack1 <- DataTrack(xgr[xgr %over% z], name = "sample 1")
dtrack2 <- DataTrack(ygr[ygr %over% z], name = "sample 2")
plotTracks(list(gtrack, atrack, dtrack1, dtrack2))
plotTracks(list(gtrack, atrack, dtrack1, dtrack2), type="polygon")

ggbio

GGBIO builds off of the GGPLOT2 package, which is a whole other way of drawing plots in R. ggbio makes thing very easy. If you indicate the BAM file and the range of interest, it will read in the BAM file, parse the coverage, read the alignments, extract the information, and draw this nice plot.

#biocLite("ggbio")
library(ggbio)
#autoplot gene model
autoplot(gene)
autoplot(fl1, which=z)
autoplot(fl2, which=z)

Footnotes

IGV https://www.broadinstitute.org/igv/home
Gviz http://www.bioconductor.org/packages/release/bioc/html/Gviz.html
ggbio http://www.bioconductor.org/packages/release/bioc/html/ggbio.html
UCSC Genome Browser: zooms and scrolls over chromosomes, showing the work of annotators worldwide http://genome.ucsc.edu/
Ensembl genome browser: genome databases for vertebrates and other eukaryotic species http://ensembl.org
Roadmap Epigenome browser: public resource of human epigenomic data http://www.epigenomebrowser.org http://genomebrowser.wustl.edu/ http://epigenomegateway.wustl.edu/
Circos: designed for visualizing genomic data in a cirlce http://circos.ca/
SeqMonk: a tool to visualise and analyse high throughput mapped sequence data http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/

Licence

References

https://github.com/genomicsclass

Visualizing next geration sequencing data

Fri, 26 Sep 2014 01:54:43 +0200

We will try four ways to look at NGS coverage: using the standalone Java program IGV, using simple plot commands, and using the Gviz and ggbio packages in Bioconductor.