Easy Guides

ExpressionSet and SummarizedExperiment

Tue, 16 Aug 2016 23:04:28 +0200

Introduction
ExpressionSet
Summarized Experiment
Licence
References

This analysis was performed using R (ver. 3.1.0).

Introduction

The ExpressionSet is generally used for array-based experiments, where the rows are features, and the SummarizedExperiment is generally used for sequencing-based experiments, where the rows are GenomicRanges. ExpressionSet is in the Biobase library.

There’s a library GEOquery which lets you pull down ExpressionSets by identifier.

library(Biobase) #Download data from GEO library(GEOquery) geoq <- getGEO("GSE9514") #The list has a single element #Save it to letter e for simplicity names(geoq)

## [1] "GSE9514_series_matrix.txt.gz"

e <- geoq[[1]]

ExpressionSet

ExpressionSets are basically matrices with a lot of metadata around them. Here, we have a matrix which is 9,000 by 8. It has phenotypic data, feature and annotation information. You can use the functions dim, ncol, nrow to have the dimension, the number of columns and rows respectively.

The matrix of the expression data, is stored in the exprs slot and you can access that with the exprs function. The phenotypic data, can be accessed using pData and this gives us a data frame, which is information about the samples, including the accession number, the submission date, etc). The feature data is accessible with fData function and this is information about genes or probe sets. The names of the data in the feature data are, for example, the gene title, the gene symbol, the ENREZ_Gene_ID or Gene Ontology information, which might be useful for downstream analysis.

#Number of dimension dim(e)

## Features Samples ## 9335 8

#The expression matrix exprs(e)[1:3,1:3]

## GSM241146 GSM241147 GSM241148 ## 10000_at 15.33 9.459 7.985 ## 10001_at 283.47 300.729 270.016 ## 10002_i_at 2569.45 2382.815 2711.814

#Phenotypic data: information about the samples pData(e)[1:3,1:6]

## title ## GSM241146 hem1 strain grown in YPD with 250 uM ALA (08-15-06_Philpott_YG_S98_1) ## GSM241147 WT strain grown in YPD under Hypoxia (08-15-06_Philpott_YG_S98_10) ## GSM241148 WT strain grown in YPD under Hypoxia (08-15-06_Philpott_YG_S98_11) ## geo_accession status submission_date ## GSM241146 GSM241146 Public on Nov 06 2007 Nov 02 2007 ## GSM241147 GSM241147 Public on Nov 06 2007 Nov 02 2007 ## GSM241148 GSM241148 Public on Nov 06 2007 Nov 02 2007 ## last_update_date type ## GSM241146 Aug 14 2011 RNA ## GSM241147 Aug 14 2011 RNA ## GSM241148 Aug 14 2011 RNA

dim(pData(e))

## [1] 8 31

#Column names of the phenotypic data names(pData(e))

## [1] "title" "geo_accession" ## [3] "status" "submission_date" ## [5] "last_update_date" "type" ## [7] "channel_count" "source_name_ch1" ## [9] "organism_ch1" "characteristics_ch1" ## [11] "molecule_ch1" "extract_protocol_ch1" ## [13] "label_ch1" "label_protocol_ch1" ## [15] "taxid_ch1" "hyb_protocol" ## [17] "scan_protocol" "description" ## [19] "data_processing" "platform_id" ## [21] "contact_name" "contact_email" ## [23] "contact_department" "contact_institute" ## [25] "contact_address" "contact_city" ## [27] "contact_state" "contact_zip/postal_code" ## [29] "contact_country" "supplementary_file" ## [31] "data_row_count"

#Feature data : information about genes or probe sets fData(e)[1:3,1:3]

## ID ORF SPOT_ID ## 10000_at 10000_at YLR331C  ## 10001_at 10001_at YLR332W  ## 10002_i_at 10002_i_at YLR333C

dim(fData(e))

## [1] 9335 17

names(fData(e))

## [1] "ID" "ORF" ## [3] "SPOT_ID" "Species Scientific Name" ## [5] "Annotation Date" "Sequence Type" ## [7] "Sequence Source" "Target Description" ## [9] "Representative Public ID" "Gene Title" ## [11] "Gene Symbol" "ENTREZ_GENE_ID" ## [13] "RefSeq Transcript ID" "SGD accession number" ## [15] "Gene Ontology Biological Process" "Gene Ontology Cellular Component" ## [17] "Gene Ontology Molecular Function"

head(fData(e)$"Gene Symbol")

## [1] JIP3 MID2 RPS25B NUP2 ## 4869 Levels: ACO1 ARV1 ATP14 BOP2 CDA1 CDA2 CDC25 CDC3 CDD1 CTS1 ... Il4

head(rownames(e))

## [1] "10000_at" "10001_at" "10002_i_at" "10003_f_at" "10004_at" ## [6] "10005_at"

#experiment data: experimenter name, laboratory, contact, abstract. #It's empty sometimes experimentData(e)

## Experiment data ## Experimenter name: ## Laboratory: ## Contact information: ## Title: ## URL: ## PMIDs: ## No abstract available.

#Annotation plateform annotation(e)

## [1] "GPL90"

Summarized Experiment

I’m going to load a bioconductor annotation package, which is the parathyroid SummarizedExperiment library. The loaded data is a SummarizedExperiment, which summarizes counts of RNA sequencing reads in genes for an experiment on human cell culture. The SummarizedExperiment object has 63,000 rows, which are genes, and 27 columns, which are samples, and the matrix, in this case, is called counts. And we have the row names, which are ensemble genes, and metadata about the row data, and metadata about the column data.

library(parathyroidSE) #RNA sequencing reads data(parathyroidGenesSE) se <- parathyroidGenesSE se

## class: SummarizedExperiment ## dim: 63193 27 ## exptData(1): MIAME ## assays(1): counts ## rownames(63193): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99 ## rowData metadata column names(0): ## colnames: NULL ## colData names(8): run experiment ... study sample

assay function can be used get access to the counts of RNA sequencing reads. colData function , the column data, is equivalent to the pData on the ExpressionSet. Each row in this data frame corresponds to a column in the SummarizedExperiment. We can see that there are indeed 27 rows here, which give information about the columns. Each sample in this case is treated with two treatments or control and we can see the number of replicates for each, using the as.numeric function again.

#Dimension of the SummarizedExperiment dim(se)

## [1] 63193 27

#Get access to the counts of RNA sequencing reads, using assay function. assay(se)[1:3,1:3]

## [,1] [,2] [,3] ## ENSG00000000003 792 1064 444 ## ENSG00000000005 4 1 2 ## ENSG00000000419 294 282 164

#Dimensions of this assay is a matrix, which has the same dimensions as the SummarizedExperiment. dim(assay(se))

## [1] 63193 27

#Get information about samples colData(se)[1:3,1:6]

## DataFrame with 3 rows and 6 columns ## run experiment patient treatment time submission ##       ## 1 SRR479052 SRX140503 1 Control 24h SRA051611 ## 2 SRR479053 SRX140504 1 Control 48h SRA051611 ## 3 SRR479054 SRX140505 1 DPN 24h SRA051611

#dimension of column data dim(colData(se))

## [1] 27 8

#characteristics of the samples names(colData(se))

## [1] "run" "experiment" "patient" "treatment" "time" ## [6] "submission" "study" "sample"

#Get access to treatment column of sample characteristics colData(se)$treatment

## [1] Control Control DPN DPN OHT OHT Control Control ## [9] DPN DPN DPN OHT OHT OHT Control Control ## [17] DPN DPN OHT OHT Control DPN DPN DPN ## [25] OHT OHT OHT ## Levels: Control DPN OHT

The rows in this case correspond to genes. Genes are collections of exons. The rows of the SummarizedExperiment is a GRangesList where each row corresponds to a GRanges which contains the exons, which were used to count the RNA sequencing reads. Some metadata are included in the row data and is accessible with the metadata function. This information tells us, how this GRangesList was constructed. if it was constructed from the genomic features package using a transcript database. Homo sapiens was the organism, and the database was ENSEMBL GENES number 72, and etc. In addition, there’s some more information under experiment data, using exptData and then specifying the MIAME, which is minimal information about a microarray experiment, Although we’re not using microarrays, we’ve still used the same slots to describe extra information about this object.

#Extract out a single GRanges object: 17 ranges and 2 metada columns which is #the ensembl id for the exon. rowData(se)[1]

## GRangesList of length 1: ## $ENSG00000000003 ## GRanges with 17 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ##    |   ## [1] X [99883667, 99884983] - | 664095 ENSE00001459322 ## [2] X [99885756, 99885863] - | 664096 ENSE00000868868 ## [3] X [99887482, 99887565] - | 664097 ENSE00000401072 ## [4] X [99887538, 99887565] - | 664098 ENSE00001849132 ## [5] X [99888402, 99888536] - | 664099 ENSE00003554016 ## ... ... ... ... ... ... ... ## [13] X [99890555, 99890743] - | 664106 ENSE00003512331 ## [14] X [99891188, 99891686] - | 664108 ENSE00001886883 ## [15] X [99891605, 99891803] - | 664109 ENSE00001855382 ## [16] X [99891790, 99892101] - | 664110 ENSE00001863395 ## [17] X [99894942, 99894988] - | 664111 ENSE00001828996 ## ## --- ## seqlengths: ## 1 2 ... LRG_99 ## 249250621 243199373 ... 13294

#rowData is indeed GRangesList class(rowData(se))

## [1] "GRangesList" ## attr(,"package") ## [1] "GenomicRanges"

#length gives the number of genes length(rowData(se))

## [1] 63193

#length of the first GRanges : gives the number of exons for the first gene. length(rowData(se)[[1]])

## [1] 17

head(rownames(se))

## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ## [5] "ENSG00000000460" "ENSG00000000938"

metadata(rowData(se))

## $genomeInfo ## $genomeInfo$`Db type` ## [1] "TranscriptDb" ## ## $genomeInfo$`Supporting package` ## [1] "GenomicFeatures" ## ## $genomeInfo$`Data source` ## [1] "BioMart" ## ## $genomeInfo$Organism ## [1] "Homo sapiens" ## ## $genomeInfo$`Resource URL` ## [1] "www.biomart.org:80" ## ## $genomeInfo$`BioMart database` ## [1] "ensembl" ## ## $genomeInfo$`BioMart database version` ## [1] "ENSEMBL GENES 72 (SANGER UK)" ## ## $genomeInfo$`BioMart dataset` ## [1] "hsapiens_gene_ensembl" ## ## $genomeInfo$`BioMart dataset description` ## [1] "Homo sapiens genes (GRCh37.p11)" ## ## $genomeInfo$`BioMart dataset version` ## [1] "GRCh37.p11" ## ## $genomeInfo$`Full dataset` ## [1] "yes" ## ## $genomeInfo$`miRBase build ID` ## [1] NA ## ## $genomeInfo$transcript_nrow ## [1] "213140" ## ## $genomeInfo$exon_nrow ## [1] "737783" ## ## $genomeInfo$cds_nrow ## [1] "531154" ## ## $genomeInfo$`Db created by` ## [1] "GenomicFeatures package from Bioconductor" ## ## $genomeInfo$`Creation time` ## [1] "2013-07-30 17:30:25 +0200 (Tue, 30 Jul 2013)" ## ## $genomeInfo$`GenomicFeatures version at creation time` ## [1] "1.13.21" ## ## $genomeInfo$`RSQLite version at creation time` ## [1] "0.11.4" ## ## $genomeInfo$DBSCHEMAVERSION ## [1] "1.0"

exptData(se)$MIAME

## Experiment data ## Experimenter name: Felix Haglund ## Laboratory: Science for Life Laboratory Stockholm ## Contact information: Mikael Huss ## Title: DPN and Tamoxifen treatments of parathyroid adenoma cells ## URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211 ## PMIDs: 23024189 ## ## Abstract: A 251 word abstract is available. Use 'abstract' method.

abstract(exptData(se)$MIAME)

## [1] "Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease."

Licence

References

https://github.com/genomicsclass

GRanges and GRangesList

Thu, 25 Sep 2014 00:32:56 +0200

GRanges
GRangesList
findOverlaps and %over%
Rle and Views
Licence
References

This analysis was performed using R (ver. 3.1.0).

GRanges

The GenomicRanges package, which is an extension of IRanges to the genomic space. GRanges object contains the sequence names (here chromosome Z), the Iranges, the strand information and the sequence lengths. If we print out the GRanges object, we see that we have two ranges, zero metadata columns. And it gives the sequence names as an rle which we’ll discuss later. It gives the IRanges, and the strand also as an rle, and the bottom it prints the sequence lengths. We’ve specified that chromosome z is 100 base pairs long.

library(GenomicRanges)
gr <- GRanges(seqnames="chrZ", ranges=IRanges(start=c(5,10),end=c(35,45)),
              strand="+", seqlengths=c(chrZ=100L))
gr

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ  [ 5, 35]      +
##   [2]     chrZ  [10, 45]      +
##   ---
##   seqlengths:
##    chrZ
##     100

Like with IRanges, we can shift the GRanges, and it will move the starts and ends by 10 base pairs to the right. We can also shift by 80. But notice that if we shift by 80, these will go off the end of the chromosome. GenomicRanges package gives us an error that says that the ranges contain values outside of the sequence bounds. If we wrap this in a trim function it will make sure that these end at the chromosome end and then don’t go over it.

shift(gr, 10)

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ  [15, 45]      +
##   [2]     chrZ  [20, 55]      +
##   ---
##   seqlengths:
##    chrZ
##     100

shift(gr, 80)

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ [85, 115]      +
##   [2]     chrZ [90, 125]      +
##   ---
##   seqlengths:
##    chrZ
##     100

trim(shift(gr, 80))

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ [85, 100]      +
##   [2]     chrZ [90, 100]      +
##   ---
##   seqlengths:
##    chrZ
##     100

The metadata columns we mentioned previously, can be accessed by using the function mcols for metadata columns. Before we have zero columns here. We can add columns by using mcols plus the dollar sign. Now we have an additional column which is a numeric and has two values.

mcols(gr)

## DataFrame with 2 rows and 0 columns

mcols(gr)$value <- c(-1,4)
gr

## GRanges with 2 ranges and 1 metadata column:
##       seqnames    ranges strand |     value
##              | 
##   [1]     chrZ  [ 5, 35]      + |        -1
##   [2]     chrZ  [10, 45]      + |         4
##   ---
##   seqlengths:
##    chrZ
##     100

GRangesList

There’s an additional class in the GRanges package, which is called GRangesList. GRangesList is an object which groups GRanges together. The most obvious example of a GRangesList would be grouping exons by gene, or grouping exons by transcript.

#Create a second GRanges
gr2 <- GRanges("chrZ",IRanges(11:13,51:53))
mcols(gr)$value <- NULL
#Create GRangesList : This object contains two GRanges
grl <- GRangesList(gr,gr2)
grl

## GRangesList of length 2:
## [[1]] 
## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ  [ 5, 35]      +
##   [2]     chrZ  [10, 45]      +
## 
## [[2]] 
## GRanges with 3 ranges and 0 metadata columns:
##       seqnames   ranges strand
##   [1]     chrZ [11, 51]      *
##   [2]     chrZ [12, 52]      *
##   [3]     chrZ [13, 53]      *
## 
## ---
## seqlengths:
##  chrZ
##   100

#It's a GRanges list of length two, 
#where the first GRanges has two ranges and the second GRanges has three ranges.
length(grl)

## [1] 2

#Ask for the first element
grl[[1]]

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ  [ 5, 35]      +
##   [2]     chrZ  [10, 45]      +
##   ---
##   seqlengths:
##    chrZ
##     100

#If you specify metadata columns to the GRanges list,
#these will be assigned to each GRanges object in the list.
mcols(grl)$value <- c(5,7)
grl

## GRangesList of length 2:
## [[1]] 
## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ  [ 5, 35]      +
##   [2]     chrZ  [10, 45]      +
## 
## [[2]] 
## GRanges with 3 ranges and 0 metadata columns:
##       seqnames   ranges strand
##   [1]     chrZ [11, 51]      *
##   [2]     chrZ [12, 52]      *
##   [3]     chrZ [13, 53]      *
## 
## ---
## seqlengths:
##  chrZ
##   100

mcols(grl)

## DataFrame with 2 rows and 1 column
##       value
##   
## 1         5
## 2         7

findOverlaps and %over%

Once we’ve created sets of GRanges or GRangesList objects, one common thing we might need to do is to find overlaps between objects. Let’s create two GRanges objects. The first one, will have five ranges. So 1 to 5, 11 to 15, 21 to 2. And the second object will have two ranges. Both GRanges objects are on the same sequence, chromosome z. We’ll use the findOverlaps function to find the overlaps. The first two arguments of this function, query and subject, are the most important. If you want to count overlaps, you can use the countOverlaps function.

#Creating two GRanges objects
gr1 <- GRanges("chrZ",IRanges(c(1,11,21,31,41),width=5))
gr2 <- GRanges("chrZ",IRanges(c(19,33),c(38,35)))
gr1

## GRanges with 5 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ  [ 1,  5]      *
##   [2]     chrZ  [11, 15]      *
##   [3]     chrZ  [21, 25]      *
##   [4]     chrZ  [31, 35]      *
##   [5]     chrZ  [41, 45]      *
##   ---
##   seqlengths:
##    chrZ
##      NA

gr2

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ  [19, 38]      *
##   [2]     chrZ  [33, 35]      *
##   ---
##   seqlengths:
##    chrZ
##      NA

#Find overlaps
fo <- findOverlaps(query=gr1, subject=gr2)
fo

## Hits of length 3
## queryLength: 5
## subjectLength: 2
##   queryHits subjectHits 
##        
##  1         3           1 
##  2         4           1 
##  3         4           2

The output of the findOverlaps function is a hits object, which has length three, and this gives us the three different overlaps which occurred. The table here tells us that the third element of the query (gr1) intersected with the first element of the subject (gr2). These are given as integer vectors.

Another way to get the overlaps is to use the over function, %over%. It gives a logical vector. For the five ranges in gr1, it gives a logical vector describing which of these have any overlap with the ranges in the second, so gr2. If we use that as a subset, so a logical sub-setting, we returned those ranges in gr1, which had some overlap with gr2.

queryHits(fo)

## [1] 3 4 4

subjectHits(fo)

## [1] 1 1 2

gr1 %over% gr2

## [1] FALSE FALSE  TRUE  TRUE FALSE

gr1[gr1 %over% gr2]

## GRanges with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##             
##   [1]     chrZ  [21, 25]      *
##   [2]     chrZ  [31, 35]      *
##   ---
##   seqlengths:
##    chrZ
##      NA

Rle and Views

Rle is an object which is defined by IRanges. But also there’s a similar object in base r which is a run length encoding. The meaning of this is that if you have a vector, which repeats certain values, you can save memory. By instead of storing each value, you save the number, and then the number of repeats.

If we have such an rle object, and we want to peer into it in different regions, we can construct a views object. Views is a virtual class, which contains the subject, and then a number of views, which are essentially IRanges into that object. You can also use the views constructor for FASTA files, for example, if you want to look into genome sequence or other objects.

r <- Rle(c(1,1,1,0,0,-2,-2,-2,rep(-1,20)))
r

## numeric-Rle of length 28 with 4 runs
##   Lengths:  3  2  3 20
##   Values :  1  0 -2 -1

#Structure of r
str(r)

## Formal class 'Rle' [package "IRanges"] with 4 slots
##   ..@ values         : num [1:4] 1 0 -2 -1
##   ..@ lengths        : int [1:4] 3 2 3 20
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

as.numeric(r)

##  [1]  1  1  1  0  0 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
## [24] -1 -1 -1 -1 -1

#creating 2 views of the rle
Views(r, start=c(4,2), end=c(7,6))

## Views on a 28-length Rle subject
## 
## views:
##     start end width
## [1]     4   7     4 [ 0  0 -2 -2]
## [2]     2   6     5 [ 1  1  0  0 -2]

Licence

References

https://github.com/genomicsclass

Installing from github.com

Thu, 25 Sep 2014 00:29:38 +0200

Installing from github.com
Licence
References

This analysis was performed using R (ver. 3.1.0).

Installing from github.com

#install.packages("devtools")
#library("devtools")
#pack=rafalib, ririzarr=user
#install_github("rafalib", "ririzarr")
library(rafalib)
mypar

## function (a = 1, b = 1, brewer.n = 8, brewer.name = "Dark2", 
##     cex.lab = 2, cex.main = 2, cex.axis = 1.5, mar = c(5.1, 5.1, 
##         3.5, 2.1), mgp = c(3, 1, 0), ...) 
## {
##     require(RColorBrewer)
##     par(mar = mar, mgp = mgp, cex.lab = cex.lab, cex.main = cex.main, 
##         cex.axis = cex.axis)
##     par(mfrow = c(a, b), ...)
##     palette(brewer.pal(brewer.n, brewer.name))
## }
##

shist(rnorm(100))

Licence

References

https://github.com/genomicsclass

IRanges

Thu, 25 Sep 2014 00:25:46 +0200

IRanges
Licence
References

This analysis was performed using R (ver. 3.1.0).

IRanges

IRanges is a library for representing ranges of integers, which is useful in genomics, because we have base pair ranges that we'd like to manipulate.There's a very detailed vignette on IRanges. IRanges have a start, an end and a width.

library(IRanges)

#IR with 6 base-pairs long : start and end are indicated
ir <- IRanges(5,10) 
ir

## IRanges of length 1
##     start end width
## [1]     5  10     6

start(ir) #Return the start value

## [1] 5

end(ir) #Return the end value

## [1] 10

width(ir) #return the width

## [1] 6

# ?IRanges

If you specify that the start is 5, and the width should be 6, you'll get the identical IRange.

#IR with 6 base-pairs long : start and width are indicated
ir <- IRanges(5,width=6)
ir

## IRanges of length 1
##     start end width
## [1]     5  10     6

You can also specify more than one range at a time. If you give three starts and three ends, you get a IRanges object of length 3. If you, again, ask for the starts, you'll get the three different starts, 3, 5, and 17.

# Miultiple ranges
ir <- IRanges(start=c(3,5,17), end=c(10,8,20))
ir

## IRanges of length 3
##     start end width
## [1]     3  10     8
## [2]     5   8     4
## [3]    17  20     4

length(ir)

## [1] 3

start(ir)

## [1]  3  5 17

There are a number of intra range methods for IRanges. And intra-range means that the operation will occur for each range that you have. An example of this is to shift the IRange to the left by two. Before we had an IRange that started at 5 and ended at 10. Applying the shift operation produces an IRange which starts at 3 and ends at 8.

# ?"intra-range-methods"
ir <- IRanges(5,10)
shift(ir, -2)

## IRanges of length 1
##     start end width
## [1]     3   8     6

Remember, all of these commands can work on more than one range at once.

Here we show the effects of the different methods using a single range:

#Shift the IRange to the left by 2 base-pairs
shift(ir,-2)

## IRanges of length 1
##     start end width
## [1]     3   8     6

#narrow : Relative to the start, you should start this range at the second base pair.
narrow(ir, start=2)

## IRanges of length 1
##     start end width
## [1]     6  10     5

#narrow: Relative to 5, you should end on the fifth base pair, which means it should end at 9.
narrow(ir, end=5)

## IRanges of length 1
##     start end width
## [1]     5   9     5

#flank allows you to get flanking sequence
#here three base pairs from the start.
flank(ir, width=3, start=TRUE, both=FALSE)

## IRanges of length 1
##     start end width
## [1]     2   4     3

#flank: three base pairs from the end by specifying start equals false
flank(ir, width=3, start=FALSE, both=FALSE)

## IRanges of length 1
##     start end width
## [1]    11  13     3

#bidirectional flanking sequence from the start by specifying both equals true.
flank(ir, width=3, start=TRUE, both=TRUE)

## IRanges of length 1
##     start end width
## [1]     2   7     6

ir * 2

## IRanges of length 1
##     start end width
## [1]     6   8     3

ir + 2

## IRanges of length 1
##     start end width
## [1]     3  12    10

ir - 2

## IRanges of length 1
##     start end width
## [1]     7   8     2

The exact same functions are shown graphically with the code below:

# set up a plotting window so we can look at range operations
plotir <- function(ir,i) { arrows(start(ir)-.5,i,end(ir)+.5,i,code=3,angle=90,lwd=3) }
plot(0,0,xlim=c(0,15),ylim=c(0,11),type="n",xlab="",ylab="",xaxt="n")
axis(1,0:15)
abline(v=0:30 + .5,col=rgb(0,0,0,.5))

# plot the original IRange
plotir(ir,1)

# draw a red shadow for the original IRange
polygon(c(start(ir)-.5,start(ir)-.5,end(ir)+.5,end(ir)+.5),c(-1,12,12,-1),col=rgb(1,0,0,.2),border=NA)
plotir(shift(ir,-2), 2)
plotir(narrow(ir, start=2), 3)
plotir(narrow(ir, end=5), 4)
plotir(flank(ir, width=3, start=TRUE, both=FALSE), 5)
plotir(flank(ir, width=3, start=FALSE, both=FALSE), 6)
plotir(flank(ir, width=3, start=TRUE, both=TRUE), 7)
plotir(ir * 2, 8)
plotir(ir + 2, 9)
plotir(ir - 2, 10)

The inter-range methods are those functions which depend on the other ranges in the object. Let's create an IRanges object with three ranges, which starts at 3, 5, 17, ends at 10, 8, and 20.

# ?"inter-range-methods"
ir <- IRanges(start=c(3,5,17), end=c(10,8,20))
#range: returns the beginning of the IRanges to the end, including gaps in between.
range(ir)

## IRanges of length 1
##     start end width
## [1]     3  20    18

#reduce, returns those base pairs which are covered by the original ranges.
#So we do not get the gap, so the end at 10 and the beginning at 17.
reduce(ir)

## IRanges of length 2
##     start end width
## [1]     3  10     8
## [2]    17  20     4

#gaps: returns a gap from 10 to 17.
gaps(ir)

## IRanges of length 1
##     start end width
## [1]    11  16     6

#disjoint: gives a set of ranges which has the same coverage as the original IRanges object,
#but they're not overlapping in any way, and they also contain the union of all the endpoints of the original range.
disjoin(ir)

## IRanges of length 4
##     start end width
## [1]     3   4     2
## [2]     5   8     4
## [3]     9  10     2
## [4]    17  20     4

Licence

References

https://github.com/genomicsclass

Basic Bioconductor infrastructure

Wed, 24 Sep 2014 23:39:29 +0200

-=-