<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Thu, 07 May 2026 02:53:12 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Easy Guides]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/wiki/11" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles of the category: R basics and exploratory data analysis]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[Exploratory data analysis]]></title>
			<link>https://www.sthda.com/english/wiki/exploratory-data-analysis</link>
			<guid>https://www.sthda.com/english/wiki/exploratory-data-analysis</guid>
			<description><![CDATA[<!-- START HTML -->


  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">


<div id="TOC">
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#histograms">Histograms</a></li>
<li><a href="#normal-approximation">Normal approximation</a></li>
<li><a href="#qq-plot">QQ-plot</a></li>
<li><a href="#box-plot">Box-plot</a></li>
<li><a href="#scatter-plots-and-correlation">Scatter plots and correlation</a></li>
<li><a href="#spearmans-correlation">Spearman’s correlation</a></li>
<li><a href="#median-and-mad">Median and MAD</a></li>
<li><a href="#log-transformation">Log transformation</a></li>
<li><a href="#normal-approximation-and-standard-unit">Normal approximation and standard unit</a></li>
<li><a href="#licence">Licence</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>

<hr />
<pre class="warning"><code>This analysis was performed using R (ver. 3.1.0).</code></pre>
<div id="introduction" class="section level2">
<h2>Introduction</h2>
<p>The aim of EDA (exploratory data analysis) is to familiarize ourselves with data when we analyze genomic data. Some basic EDA tools include histogram, the Q-Q plot, scatter plot, box plot, stratification, log transformation and other summary statistics.</p>
</div>
<div id="histograms" class="section level2">
<h2>Histograms</h2>
<p>Firstly, load the R library UsingR which contains several examples including one data set called <strong>father.son</strong> that has heights for fathers and sons.</p>
<pre class="r"><code>library(UsingR)
x=father.son$fheight</code></pre>
<p><strong>Histogram</strong> is a visual description of the distribution of data.</p>
<pre class="r"><code>hist(x,breaks=seq(floor(min(x)),ceiling(max(x))),main="", xlab="Height")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-histogram.png" title="plot of chunk histogram" alt="plot of chunk histogram" width="384" /></p>
<p>Notice that breaks are the the intervals, in which to make count of individuals. We decide to look at the following breaks: 59 to 60, 60 to 61, etc. Note that with this simple plot we can approximate the number of individuals in any given interval. Y axis (Frequency) is the number of individuals in a given interval. For example, there are about 70 individuals over six feet (72 inches) tall.</p>
<p>Another plot that’s related to the histogram is the <strong>empirical commutative distribution</strong> function (<strong>ecdf</strong>). It reports for any given number the percent of individuals that are below that threshold.</p>
<pre class="r"><code>#The number for which, we want to compute proportions
xs<-seq(floor(min(x)),ceiling(max(x)),0.1)
plot(xs,ecdf(x)(xs),type="l",xlab="x=Height",ylab="F(x)")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-ecdf.png" title="empirical commutative distribution" alt="empirical commutative distribution" width="288" /></p>
<p><code>xs</code> is the numbers for which, we want to compute these proportions for. The function ecdf of x creates a function that then you can feed these values in, and it’ll give you back these proportions. We can make a plot of that.</p>
<pre class="success"><code>For any value, say, 70, you can see that about 70% to 75% of our individuals are shorter than 70 inches.</code></pre>
</div>
<div id="normal-approximation" class="section level2">
<h2>Normal approximation</h2>
<p>The probability distribution we see above approximates one that is very common in a nature: the bell curve or normal distribution or <strong>Gaussian distribution</strong>. When the histogram of a list of numbers approximates the normal distribution we can use a convenient mathematical formula to approximate the proportion of individuals in any given interval</p>
<p><span class="math">\[
\mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} \, dx
\]</span></p>
<p>Here <span class="math">\(\mu\)</span> and <span class="math">\(\sigma\)</span> are refereed to as the mean and standard deviation. So, in the case of normal distribution, we just have to know the mean <span class="math">\(\mu\)</span> and the standard deviation <span class="math">\(\sigma\)</span> of the data to describe the entire distribution.</p>
<p>For example, the proportion of individuals that are taller than 70 can be computed using the following code :</p>
<pre class="r"><code>1-pnorm(70, mean(x), sd(x)) #right part of distibution</code></pre>
<pre><code>## [1] 0.1997</code></pre>
<p>The proportion of individuals with height less than 70 is given by :</p>
<pre class="r"><code>pnorm(70, mean(x), sd(x)) #left part of the distribution</code></pre>
<pre><code>## [1] 0.8003</code></pre>
<p>To check whether a data follows the normal approximation, we can use <strong>Q-Q plots</strong> (quantile-quantile plot).</p>
</div>
<div id="qq-plot" class="section level2">
<h2>QQ-plot</h2>
<p>As mentioned above, quantile-quantile plots is used to check whether a data follows normal distribution.</p>
<p>The procedures are to :</p>
<ol style="list-style-type: decimal">
<li>Calculate sample percentiles (1%, 2%, ….., 50%, 99%). So we compute, what is the number for which 1% of the data is below that number? 2%, 3%, 4%, and we compute what are called the percentiles.</li>
<li>Calculate the theoretical percentiles predicted by the normal distribution</li>
<li>Make a plot to see whether they fit</li>
</ol>
<pre class="r"><code>#percentile
ps <- seq(0.01,0.99,0.01)
qs <- quantile(x,ps)
#theorical percentiles
normalqs <- qnorm(ps,mean(x),sd(x))
#Plot
plot(normalqs,qs,xlab="Normal percentiles",ylab="Height percentiles")
abline(0,1) ##identity line</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-qq-plot.png" title="quantile-quantile plots" alt="quantile-quantile plots" width="288" /></p>
<pre class="warning"><code>Note how close these values are. </code></pre>
<p>Also note that we can do same with less code</p>
<pre class="r"><code>qqnorm(x)
qqline(x)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-easy-qq-plot.png" title="quantile-quantile plots" alt="quantile-quantile plots" width="288" /></p>
</div>
<div id="box-plot" class="section level2">
<h2>Box-plot</h2>
<p>Data is not always normally distributed. An example of data that is not normally distributed are salaries. In these cases the mean and the standard deviation are not good summaries.</p>
<pre class="r"><code>#salaries
hist(exec.pay)
qqnorm(exec.pay)
qqline(exec.pay)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-non-normal-distribution1.png" title="plot of chunk non-normal-distribution" alt="plot of chunk non-normal-distribution" width="288" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-non-normal-distribution2.png" title="plot of chunk non-normal-distribution" alt="plot of chunk non-normal-distribution" width="288" /></p>
<p>If data is not normal distributed, box plot is better to summarize the data. A practical summary is to compute 3 percentiles: 25-th, 50-th (the median) and the 75-th. A box plots shows these 3 values along with a range calculated as median <span class="math">\(\pm\)</span> 1.5 75-th percentiles - 25th-percentile. Values outside this range are shown as points.</p>
<pre class="r"><code>boxplot(exec.pay,ylab="10,000s of dollars",ylim=c(0,400))</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-boxplot.png" title="box plot" alt="box plot" width="288" /></p>
</div>
<div id="scatter-plots-and-correlation" class="section level2">
<h2>Scatter plots and correlation</h2>
<p>Scatter plots are used to study the relationship between two or more variables. A classic examples is the father/son height data.</p>
<pre class="r"><code>data("father.son")
x=father.son$fheight
y=father.son$sheight
plot(x,y,xlab="Father&amp;#39;s height in inches",
     ylab="Son&amp;#39;s height in inches",
     main=paste("correlation =",signif(cor(x,y),2)))</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-scatter-plot.png" title="scatter plot" alt="scatter plot" width="288" /></p>
<p>The scatter plot shows a general trend : the taller the father the taller to son. A summary of this trend is the <strong>correlation coefficient</strong> which in this cases is 0.5. Correlation helps to make prediction by computing regression line.</p>
<p>When two variables follow a bivariate normal distribution then for any given value of x we predict the value of y with :</p>
<p><span class="math">\[
\frac{Y - \mu_Y}{\sigma_Y} = r \frac{X-\mu_X}{\sigma_X}
\]</span></p>
<p>with the <span class="math">\(\mu\)</span> representing the averages, <span class="math">\(\sigma\)</span> the standard deviations, and <span class="math">\(r\)</span> the correlation.</p>
<pre class="success"><code>If you turn the two data in standard units, the correlation coefficient is the slope of the line that we used to predict variable y given variable x.</code></pre>
<pre class="warning"><code>We have to be careful with **non-normal data**. We can use Spearman&amp;#39;s correlation to obtain a robust estimate of the correlation in this case.</code></pre>
</div>
<div id="spearmans-correlation" class="section level2">
<h2>Spearman’s correlation</h2>
<p>In the case of <strong>non-normal data</strong>, the average and standard deviation are not good summaries. Examples include cases in which one variable is related to another by a parabolic function. Another, more common example are caused by outliers or extreme values.</p>
<pre class="r"><code>a=rnorm(100);a[1]=10
b=rnorm(100);b[1]=11
plot(a,b,main=paste("correlation =",signif(cor(a,b),2)))</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-scatter-plot-non-normal-data.png" title="Scatter plot with non normal data" alt="Scatter plot with non normal data" width="288" /></p>
<pre class="warning"><code>In the example above the data are not associated but for one pair both values are very large. The correlation here is about 0.5. This is driven by just that one point. An alternative summary for cases with outliers or extreme values is **Spearman&amp;#39;s correlation** which is based on ranks instead of the values themselves. </code></pre>
<p><strong>Spearman correlation</strong> is robust to outliers. For this example, I created two independent data sets, both with mean 0 and standard deviation 1, independent from each other. So, there is no correlation between the two data sets. And then I made a mistake in one row where both measurements became big.</p>
<pre class="r"><code>x=rnorm(100)
y=rnorm(100)
plot(x, y)
#Make a mistake on row 10
x[10]=100
y[10]=100
plot(x, y)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-spearman-correlation1.png" title="Spearman correlation" alt="Spearman correlation" width="288" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-spearman-correlation2.png" title="Spearman correlation" alt="Spearman correlation" width="288" /></p>
<p>The computed correlation is `r cor(x, y) which is so big because of the extreme point driving the correlation to be high. <strong>Spearman correlation is a very simple idea</strong>. Instead of looking at the values, we look at the ranks. Instead of plotting the values, I’m plotting the ranks.</p>
<pre class="r"><code>rankX=match(x, sort(x))
rankY=match(y, sort(y))
#plot of the ranks
plot(rankX, rankY)
#correlation
cor(rankX, rankY)</code></pre>
<pre><code>## [1] 0.07182</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-spearman-correlation2_.png" title="Spearman correlation" alt="Spearman correlation" width="288" /></p>
<p>The correlation computed on the ranks is just 0.0718, much closer to the correct answer, which is 0 here. That’s <strong>Spearman correlation</strong>. When we say robust, we mean it doesn’t affect the measurement so much.</p>
<p><strong>An easy way to run spearman rank correlation is :</strong></p>
<pre class="r"><code>#possible values for method : "pearson", "kendall", "spearman"
#see ?cor
cor(x, y, method="spearman")</code></pre>
<pre><code>## [1] 0.07182</code></pre>
</div>
<div id="median-and-mad" class="section level2">
<h2>Median and MAD</h2>
<p><strong>Median and MAD</strong> are another robust summary statistics. Let’s start with a very simple illustration to understand this concept. I randomly generated 100 data points from a normal distribution with mean 0 and standard deviation 1. And then I made a mistake on one of the measurements and it turned into 100.</p>
<pre class="r"><code>x<-rnorm(100)
x[10]=100
boxplot(x)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/genomics/exploratory-data-analysis-median-mad.png" title="Median and MAD" alt="Median and MAD" width="288" /></p>
<p>So you can see on the boxplot that one point is extremely high compared to the others and way up at 100. Points like this are called outliers in statistics and are common in high throughput measurement. This one point is going to have a big influence on the sample mean and the standard deviation. The new mean of the modified x values is 0.9323 and the new standard deviation (SD) is 10.0543 which are very different to the original mean and SD.</p>
<p>One summary statistic that isn’t sensitive to outliers is the <strong>median</strong>. The median is simply the middle point of the data. In this case, the median is -0.0207 which are pretty close to 0 and the MAD, which I’m going to describe next is 1.0309, which is pretty close to 1.</p>
<pre class="r"><code>median(x)</code></pre>
<pre><code>## [1] -0.02073</code></pre>
<pre class="r"><code>mad(x)</code></pre>
<pre><code>## [1] 1.031</code></pre>
<p><strong>What is the MAD?</strong> The MAD (median absolute deviation) is a robust estimate of the standard deviation. MAD is computed as the following formula:</p>
<p><span class="math">\[
MAD = 1.4826 * median\{|X_i - median(X_i)|\}
\]</span></p>
<p>We compute the median of our samples and then we compute the distance of each point to the median as the absolute value of the difference. Then we take the median of those deviations. that’s where the name comes <strong>median absolute deviation, MAD</strong>. This value is multiplied by the factor 1.4826, to make the summary statistic unbiased. On average, it’s going to be equal to the standard deviation.</p>
<pre class="warning"><code>So why do we ever use the average, the standard deviation, and Pearson&amp;#39;s correlation? In genomics data where outliers are common, the median, the MAD, and other robust statistics are usually preferable. However, there are cases where using robust statistics actually makes our analysis less sensitive to finding real differences.</code></pre>
</div>
<div id="log-transformation" class="section level2">
<h2>Log transformation</h2>
<p>When look at microarray or sequencing data, we can compute the mean across samples and the variance across samples. We usually see a relationship between the average and the standard deviation. The higher the average, the higher the standard deviation. So, For highly expressed gene, we have high standard deviations. This can be easily seen by making the log-log plot of the SD according to the average.</p>
<p>A problem is that simple summary statistics like the average are going to be driven by the variance of these very, very highly expressed genes. One simple approach to solve this problem is to take the log of the data for computing the average and the SD across samples. After log transformation, mean-variance relationship is much more flat, which is more desirable. Now we have high expressed genes having about the same variance as most genes.</p>
<pre class="success"><code>Log transormation helps to stabilize the variance/mean relationship. Another reason to use logs, is the symetry of log ratios. For example, 2-fold gene overexpression and 2-fold gene underexpression are represented by 1 and -1 in log2 scale in contrast of 2 and 0.5 in linear scale.</code></pre>
</div>
<div id="normal-approximation-and-standard-unit" class="section level2">
<h2>Normal approximation and standard unit</h2>
<p>Normal distribution is also called the bell curve or Gaussian distribution. the formula is shown below:</p>
<p><span class="math">\[
\mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} \, dx
\]</span></p>
<p>It tells us that the proportion of numbers between a and b is given by this integral. Here <span class="math">\(\mu\)</span> and <span class="math">\(\sigma\)</span> are refereed to as the mean and standard deviation. Those are the only two numbers we need to know in the case of normal distribution and we can answer the question, what proportion of x is there between any a and any b? For example, you’re going to know that 95% of your data is within two standard deviations of the average. You’re going to know that 99% of your data is within 2.5 standard deviations of the average. 68% are within one standard deviation from the average.</p>
<p>To look whether the data is normally distributed, we can look at what is called a <strong>quantile-quantile plot</strong>. We have to compute the percentiles. So we go 1%, 2%, 3%. So we compute, what is the number for which 1% of the data is below that number? 2%, 3%, 4%, and we compute what are called the percentiles.We do that for our data set, and we also do that for the normal distribution. The percentiles of normal distribution (y-axis) is plotted according to the percentiles of our data (x-axis). If these two have the same distribution, then these points should fall on the identity line, and in this case they are pretty close. So this is telling us that this is a pretty good approximation.</p>
<p><strong>Standard units</strong> is a very useful concept. If you know your data is approximately normal, then you can convert it to what are called standard units, by subtracting the mean and dividing by the standard deviation for each point in your distribution. Those are now going to have normal distribution with mean 0 and standard deviation 1. The value means, how many standard deviations away am I from the mean? If we have a 2, it is a big number, because if it’s normally distributed, only 5% of the data is more than two standard deviations away from the mean.</p>
<pre class="success"><code>In summary, if your data is approximated by the normal distribution, then all you need to know about your data is the mean and standard deviation. With those two numbers, you can describe the proportion of any interval.

In other cases, for example gene expression without taking the log. The mean and standard deviation can&amp;#39;t be considered summaries in the same way. They may be useful for certain calculations, but they will not be summaries of the data in the same way as they are when the data is normally distributed.</code></pre>
</div>
<div id="licence" class="section level2">
<h2>Licence</h2>
<p><a href="https://www.sthda.com/english/english/wiki/mit-licence">Licence</a></p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p><a href="https://github.com/genomicsclass">https://github.com/genomicsclass</a></p>
</div>

<script>jQuery(document).ready(function () {jQuery('h1,h2,h3,h4').addClass('formatter-title');});//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

<!-- END HTML -->]]></description>
			<pubDate>Thu, 25 Sep 2014 00:23:35 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[R basics and exploratory data analysis]]></title>
			<link>https://www.sthda.com/english/wiki/r-basics-and-exploratory-data-analysis</link>
			<guid>https://www.sthda.com/english/wiki/r-basics-and-exploratory-data-analysis</guid>
			<description><![CDATA[-=-]]></description>
			<pubDate>Wed, 24 Sep 2014 22:14:33 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[What we measure and why?]]></title>
			<link>https://www.sthda.com/english/wiki/what-we-measure-and-why</link>
			<guid>https://www.sthda.com/english/wiki/what-we-measure-and-why</guid>
			<description><![CDATA[<!-- START HTML -->

 <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">

<div id="TOC">
<ul>
<li><a href="#molecular-basis-of-phenotypic-variation">Molecular basis of phenotypic variation</a></li>
<li><a href="#dna-chromosomes-snps-and-other-variants">DNA: chromosomes, SNPs and other variants</a></li>
<li><a href="#gene-expression">Gene expression</a></li>
<li><a href="#licence">Licence</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>

<hr />
<pre class="warning"><code>This analysis was performed using R (ver. 3.1.0).</code></pre>
<div id="molecular-basis-of-phenotypic-variation" class="section level2">
<h2>Molecular basis of phenotypic variation</h2>
<p>Genomics is widely used in scientific and clinical research to make discoveries and to develop diagnostic tools. One of the uses of these technologies is to try to explain the molecular basis for phenotypic variation. For example, why cancer cells live longer than normal cells? And why are some cancer cells susceptible to treatment, while others are not?</p>
<p>One of the explanation for these phenotypic differences must come from the genome. DNA is transcribed into a message (RNA), and this message is translated into proteins. Changes in the amount and in the structure of these molecules can explain phenotypic variation. This is why we want to use these technologies to answer these questions.</p>
</div>
<div id="dna-chromosomes-snps-and-other-variants" class="section level2">
<h2>DNA: chromosomes, SNPs and other variants</h2>
<p>DNA is the molecule where genetic information is stored. DNA is empaquetted in chromosomes and stored in the nucleus. Every cell in our body has an exact copy of these chromosomes with the same information. Most of our cells have two alleles for each chromosome. And the exception are the sperm and egg that only have one.</p>
<p>Biologists have measured the DNA of many individuals and have noticed that, although 99.9% of their genome is identical. There are places where they see differences. And they’ll see a proportion of the population with one variant and the rest with another. These sites, their current estimates are in the millions, are called SNPs, and you will many times see two versions of the two possible bases at the site. This is why there’s so much variabilty between humans.</p>
<p>These variants are common in humans. And biologists today are trying to find if some of these SNPs can be used to explain phenotypic variation, like, for example, disease. There are some technologies to measure these SNPs for millions of sites at the same time.</p>
</div>
<div id="gene-expression" class="section level2">
<h2>Gene expression</h2>
<p>Different cell types have the same DNA so how are they so different? How is a liver cell so different from a neuron? How is a neuron so different from a colon cell?</p>
<p>One of the reasons this happens is because different genes are expressed. Genes are a small segments of DNA coding for protein.</p>
</div>
<div id="licence" class="section level2">
<h2>Licence</h2>
<p><a href="https://www.sthda.com/english/english/wiki/mit-licence">Licence</a></p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p><a href="https://github.com/genomicsclass">https://github.com/genomicsclass</a></p>
</div>

<script>jQuery(document).ready(function () {jQuery('h1,h2,h3,h4').addClass('formatter-title');});//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- END HTML -->]]></description>
			<pubDate>Wed, 24 Sep 2014 16:37:12 +0200</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[R programming skills]]></title>
			<link>https://www.sthda.com/english/wiki/r-programming-skills</link>
			<guid>https://www.sthda.com/english/wiki/r-programming-skills</guid>
			<description><![CDATA[<!-- START HTML -->

 <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">

<div id="TOC">
<ul>
<li><a href="#resources">Resources</a></li>
<li><a href="#r-refresher">R refresher</a><ul>
<li><a href="#data-summaries-summary-str">Data summaries: summary, str</a></li>
<li><a href="#aligning-two-objects-match-merge">Aligning two objects: match, merge</a></li>
<li><a href="#analysis-over-groups-split-tapply-and-dplyr-libary">Analysis over groups: split, tapply, and dplyr libary</a></li>
</ul></li>
<li><a href="#installing-bioconductor-and-finding-help">Installing Bioconductor and finding help</a><ul>
<li><a href="#installing-bioconductor">Installing Bioconductor</a></li>
<li><a href="#finding-help">Finding help</a></li>
<li><a href="#vignettes">Vignettes</a></li>
</ul></li>
<li><a href="#licence">Licence</a></li>
<li><a href="#references">References</a></li>
</ul>
</div>

<hr />
<pre class="warning"><code>This analysis was performed using R (ver. 3.1.0).</code></pre>
<div id="resources" class="section level2">
<h2>Resources</h2>
<ol style="list-style-type: decimal">
<li><a href="http://cran.r-project.org/doc/contrib/Short-refcard.pdf">R Ref card</a></li>
<li><a href="http://cran.r-project.org/other-docs.html">http://cran.r-project.org/other-docs.html</a></li>
<li><a href="http://www.statmethods.net/">Quick-R</a></li>
<li><a href="http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual">R and bioconductor Manual</a></li>
</ol>
</div>
<div id="r-refresher" class="section level2">
<h2>R refresher</h2>
<p>Summarizing, matching and grouping dataset.</p>
<pre class="r"><code>#load a data.frame
rats <- data.frame(id = paste0("rat",1:10),  
                   sex = factor(rep(c("female","male"),each=5)),
                   weight = c(2,4,1,11,18,12,7,12,19,20),
                   length = c(100,105,115,130,95,150,165,180,190,175))
rats</code></pre>
<pre><code>##       id    sex weight length
## 1   rat1 female      2    100
## 2   rat2 female      4    105
## 3   rat3 female      1    115
## 4   rat4 female     11    130
## 5   rat5 female     18     95
## 6   rat6   male     12    150
## 7   rat7   male      7    165
## 8   rat8   male     12    180
## 9   rat9   male     19    190
## 10 rat10   male     20    175</code></pre>
<div id="data-summaries-summary-str" class="section level3">
<h3>Data summaries: summary, str</h3>
<p>The <code>summary</code> and <code>str</code> functions are two helpful functions for getting a sense of data. <code>summary</code> works on vectors or matrix-like objects (including data.frames). <code>str</code> works on an arbitrary R object and will compactly display the structure.</p>
<pre class="r"><code>summary(rats) #Data frame summaries</code></pre>
<pre><code>##        id        sex        weight          length   
##  rat1   :1   female:5   Min.   : 1.00   Min.   : 95  
##  rat10  :1   male  :5   1st Qu.: 4.75   1st Qu.:108  
##  rat2   :1              Median :11.50   Median :140  
##  rat3   :1              Mean   :10.60   Mean   :140  
##  rat4   :1              3rd Qu.:16.50   3rd Qu.:172  
##  rat5   :1              Max.   :20.00   Max.   :190  
##  (Other):4</code></pre>
<pre class="r"><code>summary(rats$weight) #Numeric vector summaries</code></pre>
<pre><code>##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.75   11.50   10.60   16.50   20.00</code></pre>
<pre class="r"><code>str(rats) #Display data structure</code></pre>
<pre><code>## &amp;#39;data.frame&amp;#39;:    10 obs. of  4 variables:
##  $ id    : Factor w/ 10 levels "rat1","rat10",..: 1 3 4 5 6 7 8 9 10 2
##  $ sex   : Factor w/ 2 levels "female","male": 1 1 1 1 1 2 2 2 2 2
##  $ weight: num  2 4 1 11 18 12 7 12 19 20
##  $ length: num  100 105 115 130 95 150 165 180 190 175</code></pre>
</div>
<div id="aligning-two-objects-match-merge" class="section level3">
<h3>Aligning two objects: match, merge</h3>
<p>We load another example data frame, with the original ID and another secretID. Suppose we want to sort the original data frame by the secretID.</p>
<pre class="r"><code>ratsTable <- data.frame(id = paste0("rat",c(6,9,7,3,5,1,10,4,8,2)),
                        secretID = 1:10)
ratsTable</code></pre>
<pre><code>##       id secretID
## 1   rat6        1
## 2   rat9        2
## 3   rat7        3
## 4   rat3        4
## 5   rat5        5
## 6   rat1        6
## 7  rat10        7
## 8   rat4        8
## 9   rat8        9
## 10  rat2       10</code></pre>
<p><code>match</code> gives you, for each element in the first vector, the index of the first match in the second vector.</p>
<pre class="r"><code>match(ratsTable$id, rats$id)</code></pre>
<pre><code>##  [1]  6  9  7  3  5  1 10  4  8  2</code></pre>
<pre class="r"><code>rats[match(ratsTable$id, rats$id),] </code></pre>
<pre><code>##       id    sex weight length
## 6   rat6   male     12    150
## 9   rat9   male     19    190
## 7   rat7   male      7    165
## 3   rat3 female      1    115
## 5   rat5 female     18     95
## 1   rat1 female      2    100
## 10 rat10   male     20    175
## 4   rat4 female     11    130
## 8   rat8   male     12    180
## 2   rat2 female      4    105</code></pre>
<pre class="r"><code>cbind(rats[match(ratsTable$id, rats$id),], ratsTable)</code></pre>
<pre><code>##       id    sex weight length    id secretID
## 6   rat6   male     12    150  rat6        1
## 9   rat9   male     19    190  rat9        2
## 7   rat7   male      7    165  rat7        3
## 3   rat3 female      1    115  rat3        4
## 5   rat5 female     18     95  rat5        5
## 1   rat1 female      2    100  rat1        6
## 10 rat10   male     20    175 rat10        7
## 4   rat4 female     11    130  rat4        8
## 8   rat8   male     12    180  rat8        9
## 2   rat2 female      4    105  rat2       10</code></pre>
<p>Or you can use the <code>merge</code> function which will handle everything for you.</p>
<pre class="r"><code>ratsMerged <- merge(rats, ratsTable, by.x="id", by.y="id")
ratsMerged[order(ratsMerged$secretID),]</code></pre>
<pre><code>##       id    sex weight length secretID
## 7   rat6   male     12    150        1
## 10  rat9   male     19    190        2
## 8   rat7   male      7    165        3
## 4   rat3 female      1    115        4
## 6   rat5 female     18     95        5
## 1   rat1 female      2    100        6
## 2  rat10   male     20    175        7
## 5   rat4 female     11    130        8
## 9   rat8   male     12    180        9
## 3   rat2 female      4    105       10</code></pre>
</div>
<div id="analysis-over-groups-split-tapply-and-dplyr-libary" class="section level3">
<h3>Analysis over groups: split, tapply, and dplyr libary</h3>
<p>Suppose we need to calculate the average rat weight for each sex. We could start by splitting the weight vector into a list of weight vectors divided by sex. <code>split</code> is a useful function for breaking up a vector into groups defined by a second vector, typically a factor. We can then use the <code>lapply</code> function to calculate the average of each element of the list, which are vectors of weights.</p>
<pre class="r"><code>sp <- split(rats$weight, rats$sex)
sp</code></pre>
<pre><code>## $female
## [1]  2  4  1 11 18
## 
## $male
## [1] 12  7 12 19 20</code></pre>
<pre class="r"><code>lapply(sp, mean)</code></pre>
<pre><code>## $female
## [1] 7.2
## 
## $male
## [1] 14</code></pre>
<p>A shortcut for this is to use <code>tapply</code> and give the function which should run on each element of the list as a third argument:</p>
<pre class="r"><code>tapply(rats$weight, rats$sex, mean)</code></pre>
<pre><code>## female   male 
##    7.2   14.0</code></pre>
<p>R library “dplyr” can easily accomplish the same task as above. The “d” in the name is for data.frame, and the “ply” is because the library attempts to simplify tasks typically used by the set of functions: <code>sapply</code>, <code>lapply</code>, <code>tapply</code>, etc. Here is the same task as before done with the dplyr functions <code>group_by</code> and <code>summarise</code>:</p>
<pre class="r"><code>library(dplyr)
sexes <- group_by(rats, sex)
summarise(sexes, ave=mean(weight))</code></pre>
<pre><code>## Source: local data frame [2 x 2]
## 
##      sex  ave
## 1 female  7.2
## 2   male 14.0</code></pre>
<p>With dplyr, you can chain operations using the <code>%.%</code> operator:</p>
<pre class="r"><code>rats %.% group_by(sex) %.% summarise(ave=mean(weight))</code></pre>
<pre><code>## Source: local data frame [2 x 2]
## 
##      sex  ave
## 1 female  7.2
## 2   male 14.0</code></pre>
</div>
</div>
<div id="installing-bioconductor-and-finding-help" class="section level2">
<h2>Installing Bioconductor and finding help</h2>
<div id="installing-bioconductor" class="section level3">
<h3>Installing Bioconductor</h3>
<p>In order to install Bioconductor, copy the following two lines into your R console. This installation will take several minutes.</p>
<pre class="r"><code>source("http://bioconductor.org/biocLite.R")
biocLite()</code></pre>
<p>To install a specific package from bioconductor, use the following code:</p>
<pre class="r"><code>source("http://bioconductor.org/biocLite.R")
biocLite(c("genefilter","geneplotter"))</code></pre>
</div>
<div id="finding-help" class="section level3">
<h3>Finding help</h3>
<p>Type a question mark ?, followed by the function name and hitting return.</p>
<pre class="r"><code>?mean
example(mean)
mean #get the source code
class(6) #get the class of an R object</code></pre>
</div>
<div id="vignettes" class="section level3">
<h3>Vignettes</h3>
<p><strong>Vignettes</strong> are documents which accompany R packages and are required for every Bioconductor package. To browse the various vignettes of a package, use the following code :</p>
<pre class="r"><code>browseVignettes(package="Biobase")</code></pre>
</div>
</div>
<div id="licence" class="section level2">
<h2>Licence</h2>
<p><a href="https://www.sthda.com/english/english/wiki/mit-licence">Licence</a></p>
</div>
<div id="references" class="section level2">
<h2>References</h2>
<p><a href="https://github.com/genomicsclass">https://github.com/genomicsclass</a></p>
</div>

<script>jQuery(document).ready(function () {jQuery('h1,h2,h3,h4').addClass('formatter-title');});//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- END HTML -->]]></description>
			<pubDate>Wed, 24 Sep 2014 16:36:51 +0200</pubDate>
			
		</item>
		
	</channel>
</rss>
