Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).
Linear regression is the most simple and popular technique for predicting a continuous variable. It assumes a linear relationship between the outcome and the predictor variables.
The linear regression equation can be written as y = b0 + b*x + e
, where:
Technically, the linear regression coefficients are detetermined so that the error in predicting the outcome value is minimized. This method of computing the beta coefficients is called the Ordinary Least Squares method.
When you have multiple predictor variables, say x1 and x2, the regression equation can be written as y = b0 + b1*x1 + b2*x2 +e
. In some situations, there might be an interaction effect between some predictors, that is for example, increasing the value of a predictor variable x1 may increase the effectiveness of the predictor x2 in explaining the variation in the outcome variable.
Note also that, linear regression models can incorporate both continuous and categorical predictor variables.
When you build the linear regression model, you need to diagnostic whether linear model is suitable for your data.
In some cases, the relationship between the outcome and the predictor variables is not linear. In these situations, you need to build a non-linear regression, such as polynomial and spline regression.
When you have multiple predictors in the regression model, you might want to select the best combination of predictor variables to build an optimal predictive model. This process called model selection, consists of comparing multiple models containing different sets of predictors in order to select the best performing model that minimize the prediction error. Linear model selection approaches include best subsets regression and stepwise regression
In some situations, such as in genomic fields, you might have a large multivariate data set containing some correlated predictors. In this case, the information, in the original data set, can be summarized into few new variables (called principal components) that are a linear combination of the original variables. This few principal components can be used to build a linear model, which might be more performant for your data. This approach is know as principal component-based methods, which include: principal component regression and partial least squares regression.
An alternative method to simplify a large multivariate model is to use penalized regression, which penalizes the model for having too many variables. The most well known penalized regression include ridge regression and the lasso regression.
You can apply all these different regression models on your data, compare the models and finally select the best approach that explains well your data. To do so, you need some statistical metrics to compare the performance of the different models in explaining your data and in predicting the outcome of new test data.
The best model is defined as the model that has the lowest prediction error. The most popular metrics for comparing regression models, include:
RMSE = mean((observeds - predicteds)^2) %>% sqrt()
. The lower the RMSE, the better the model.Note that, the above mentioned metrics should be computed on a new test data that has not been used to train (i.e. build) the model. If you have a large data set, with many records, you can randomly split the data into training set (80% for building the predictive model) and test set or validation set (20% for evaluating the model performance).
One of the most robust and popular approach for estimating a model performance is k-fold cross-validation. It can be applied even on a small data set. k-fold cross-validation works as follow:
Taken together, the best model is the model that has the lowest cross-validation error, RMSE.
In this Part, you will learn different methods for regression analysis and we’ll provide practical example in R.
The content is organized as follow:
This article describes solutions for preserving semi-transparency when saving a ggplot2-based graphs into a high quality postscript (.eps) file format.
Contents:
To illustrate this, we start by creating ggplot2-based survival curves using the function ggsurvplot() in the survminer package. The ggsurvplot() function creates survival curves with the 95% confidence bands in a semi-transparent color.
First install (if needed) survminer as follow:
install.packages("survminer")
Then type, this:
# Fit survival curves
require("survival")
fit<- survfit(Surv(time, status) ~ sex, data = lung)
# Visualize
library("survminer")
p <- ggsurvplot(fit, data = lung,
surv.median.line = "hv", # Add medians survival
pval = TRUE, # Add p-value and tervals
conf.int = TRUE, # Add the 95% confidence band
risk.table = TRUE, # Add risk table
tables.height = 0.2,
tables.theme = theme_cleantable(),
palette = "jco",
ggtheme = theme_bw()
)
print(p)
In the plot above, the confidence band is semi-transparent. It can be saved to a PDF file without loosing the semi-transparent color.
If you try to export the picture as vector file (EPS ), the 95% confidence interval will disappear and the saved plot looks as follow:
The problem is that EPS in R does not support transparency.
A simple alternative is to export the plot into SVG file format. In the following sections, we’ll describe convenient solutions to save high-quality ggplots by preserving semi-transparency.
You can use the ggsave() function in [ggplot2] as follow:
ggsave(filename = "survival-curves.eps",
plot = print(p),
device = cairo_eps)
Or use this:
cairo_ps(filename = "survival-curves.eps",
width = 7, height = 7, pointsize = 12,
fallback_resolution = 300)
print(p)
dev.off()
Note that, the argument fallback_resolution is used to control the resolution in dpi at which semi-transparent areas are rasterized (the rest stays as vector format).
You can export the plot to Powerpoint using the ReporteRs package. ReporteRs will give you a fully editable vector format with full support for transparency as well.
We previously described how to Create and format PowerPoint documents from R software using the ReporteRs package. We also described how to export an editable ggplot from R software to powerpoint.
Briefly, to export our survival curves from R to powerpoint, the script looks like this
library('ReporteRs')
# Create a new powerpoint document
doc <- pptx()
# Add a new slide into the ppt document
doc <- addSlide(doc, slide.layout = "Two Content" )
# Add a slide title
doc <- addTitle(doc, "Survival Curves: Editable Vector Graphics" )
# Print the survival curves in the powerpoint
doc <- addPlot(doc, function() print(p, newpage = FALSE),
vector.graphic = TRUE # Make it editable
)
# write the document to a file
writeDoc(doc, file = "editable-survival-curves.pptx")
The output looks like this:
Edit the plot in powerpoint. See the video below: Editing ggplots Exported with ReporteRs into PWPT
The ggpubr R package facilitates the creation of beautiful ggplot2-based graphs for researcher with non-advanced programming backgrounds.
The current material presents a collection of articles for simply creating and customizing publication-ready plots using ggpubr. To see some examples of plots created with ggpubr click the following link: ggpubr examples.
ggpubr Key features:
Official online documentation: http://www.sthda.com/english/rpkgs/ggpubr.
In this first volume of symplyR, we are excited to share our Practical Guides to Partioning Clustering.
The course materials contain 3 chapters organized as follow:
Contents:
K-Medoids Essentials: PAM clustering
Contents:
CLARA - Clustering Large Applications
Contents:
Example of plots:
Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced.
This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.
Where to find the book:
The following figure illustrates the type of analysis to be performed depending on the type of variables contained in the data set.
There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.
However, the result is presented differently depending on the used package.
To help in the interpretation and in the visualization of multivariate analysis - such as cluster analysis and principal component methods - we developed an easy-to-use R package named factoextra (official online documentation: http://www.sthda.com/english/rpkgs/factoextra).
No matter which package you decide to use for computing principal component methods, the factoextra R package can help to extract easily, in a human readable data format, the analysis results from the different packages mentioned above. factoextra provides also convenient solutions to create ggplot2-based beautiful graphs.
Methods, which outputs can be visualized using the factoextra package are shown in the figure below:
In this book, we’ll use mainly:
The other packages - ade4, ExPosition, etc - will be also presented briefly.
This book contains 4 parts.
Part I provides a quick introduction to R and presents the key features of FactoMineR and factoextra.
Part II describes classical principal component methods to analyze data sets containing, predominantly, either continuous or categorical variables. These methods include:
In Part III, you’ll learn advanced methods for analyzing a data set containing a mix of variables (continuous and categorical) structured or not into groups:
Part IV covers hierarchical clustering on principal components (HCPC), which is useful for performing clustering with a data set containing only categorical variables or with a mixed data of categorical and continuous variables
This book presents the basic principles of the different methods and provide many examples in R. This book offers solid guidance in data mining for students and researchers.
Key features:
At the end of each chapter, we present R lab sections in which we systematically work through applications of the various methods discussed in that chapter. Additionally, we provide links to other resources and to our hand-curated list of videos on principal component methods for further learning.
Some examples of plots generated in this book are shown hereafter. You’ll learn how to create, customize and interpret these plots.
Download the preview of the book at: Principal Component Methods in R (Book preview)
simplyR is a web space where we’ll be posting practical and easy guides for solving real important problems using R programming language.
As we aren’t fans of unnecessary complications, we’ll keep the content of our tutorials / R codes as simple as possible.
Many tutorials are coming soon.
Topics we love include:
Samples of our recent publications, on R & Data Science, are:
If you want to contribute, read this: http://www.sthda.com/english/pages/contribute-to-sthda
]]>
Correlation matrix analysis is an important method to find dependence between variables. Computing correlation matrix and drawing correlogram is explained here. The aim of this article is to show you how to get the lower and the upper triangular part of a correlation matrix. We will also use the xtable R package to display a nice correlation table in html or latex formats.
Note that online software is also available here to compute correlation matrix and to plot a correlogram without any installation.
Contents:
The following R code computes a correlation matrix using mtcars data. Click here to read more.
mcor<-round(cor(mtcars),2)
mcor
mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
The result is a table of correlation coefficients between all possible pairs of variables.
To get the lower or the upper part of a correlation matrix, the R function lower.tri() or upper.tri() can be used. The formats of the functions are :
lower.tri(x, diag = FALSE)
upper.tri(x, diag = FALSE)
- x : is the correlation matrix - diag : if TRUE the diagonal are not included in the result.
The two functions above, return a matrix of logicals which has the same size of a the correlation matrix. The entries is TRUE in the lower or upper triangle :
upper.tri(mcor)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[2,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[3,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[5,] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# Hide upper triangle
upper<-mcor
upper[upper.tri(mcor)]<-""
upper<-as.data.frame(upper)
upper
mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1
cyl -0.85 1
disp -0.85 0.9 1
hp -0.78 0.83 0.79 1
drat 0.68 -0.7 -0.71 -0.45 1
wt -0.87 0.78 0.89 0.66 -0.71 1
qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1
vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1
am 0.6 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1
gear 0.48 -0.49 -0.56 -0.13 0.7 -0.58 -0.21 0.21 0.79 1
carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1
#Hide lower triangle
lower<-mcor
lower[lower.tri(mcor, diag=TRUE)]<-""
lower<-as.data.frame(lower)
lower
mpg cyl disp hp drat wt qsec vs am gear carb
mpg -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.6 0.48 -0.55
cyl 0.9 0.83 -0.7 0.78 -0.59 -0.81 -0.52 -0.49 0.53
disp 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
hp -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
drat -0.71 0.09 0.44 0.71 0.7 -0.09
wt -0.17 -0.55 -0.69 -0.58 0.43
qsec 0.74 -0.23 -0.21 -0.66
vs 0.17 0.21 -0.57
am 0.79 0.06
gear 0.27
carb
library(xtable)
print(xtable(upper), type="html")
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 1 | ||||||||||
cyl | -0.85 | 1 | |||||||||
disp | -0.85 | 0.9 | 1 | ||||||||
hp | -0.78 | 0.83 | 0.79 | 1 | |||||||
drat | 0.68 | -0.7 | -0.71 | -0.45 | 1 | ||||||
wt | -0.87 | 0.78 | 0.89 | 0.66 | -0.71 | 1 | |||||
qsec | 0.42 | -0.59 | -0.43 | -0.71 | 0.09 | -0.17 | 1 | ||||
vs | 0.66 | -0.81 | -0.71 | -0.72 | 0.44 | -0.55 | 0.74 | 1 | |||
am | 0.6 | -0.52 | -0.59 | -0.24 | 0.71 | -0.69 | -0.23 | 0.17 | 1 | ||
gear | 0.48 | -0.49 | -0.56 | -0.13 | 0.7 | -0.58 | -0.21 | 0.21 | 0.79 | 1 | |
carb | -0.55 | 0.53 | 0.39 | 0.75 | -0.09 | 0.43 | -0.66 | -0.57 | 0.06 | 0.27 | 1 |
Custom function corstars() is used to combine the correlation coefficients and the level of significance. The R code of the function is provided at the end of this article. It requires 2 packages :
Before continuing with the following exercises, you should first copy and paste the source code the function corstars(), which you can find at the bottom of this article.
corstars(mtcars[,1:7], result="html")
mpg | cyl | disp | hp | drat | wt | |
---|---|---|---|---|---|---|
mpg | ||||||
cyl | -0.85**** | |||||
disp | -0.85**** | 0.90**** | ||||
hp | -0.78**** | 0.83**** | 0.79**** | |||
drat | 0.68**** | -0.70**** | -0.71**** | -0.45** | ||
wt | -0.87**** | 0.78**** | 0.89**** | 0.66**** | -0.71**** | |
qsec | 0.42* | -0.59*** | -0.43* | -0.71**** | 0.09 | -0.17 |
p < .0001 ‘****’; p < .001 ‘***’, p < .01 ‘**’, p < .05 ‘*’
The code of corstars function (The code is adapted from the one posted on this forum and on this blog ):
# x is a matrix containing the data
# method : correlation method. "pearson"" or "spearman"" is supported
# removeTriangle : remove upper or lower triangle
# results : if "html" or "latex"
# the results will be displayed in html or latex format
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
result=c("none", "html", "latex")){
#Compute correlation matrix
require(Hmisc)
x <- as.matrix(x)
correlation_matrix<-rcorr(x, type=method[1])
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
## Define notions for significance levels; spacing is important.
mystars <- ifelse(p < .0001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " "))))
## trunctuate the correlation matrix to two decimal
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
## build a new matrix that includes the correlations with their apropriate stars
Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep="")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep="")
## remove upper triangle of correlation matrix
if(removeTriangle[1]=="upper"){
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove lower triangle of correlation matrix
else if(removeTriangle[1]=="lower"){
Rnew <- as.matrix(Rnew)
Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove last column and return the correlation matrix
Rnew <- cbind(Rnew[1:length(Rnew)-1])
if (result[1]=="none") return(Rnew)
else{
if(result[1]=="html") print(xtable(Rnew), type="html")
else print(xtable(Rnew), type="latex")
}
}
This analysis was performed using R (ver. 3.3.2).
Contents
Comparing two variances is useful in several cases, including:
When you want to perform a two samples t-test to check the equality of the variances of the two samples
When you want to compare the variability of a new measurement method to an old one. Does the new method reduce the variability of the measure?
Typical research questions are:
In statistics, we can define the corresponding null hypothesis (\(H_0\)) as follow:
The corresponding alternative hypotheses (\(H_a\)) are as follow:
Note that:
The test statistic can be obtained by computing the ratio of the two variances \(S_A^2\) and \(S_B^2\).
\[F = \frac{S_A^2}{S_B^2}\]
The degrees of freedom are \(n_A - 1\) (for the numerator) and \(n_B - 1\) (for the denominator).
Note that, the more this ratio deviates from 1, the stronger the evidence for unequal population variances.
Note that, the F-test requires the two samples to be normally distributed.
The R function var.test() can be used to compare two variances as follow:
# Method 1
var.test(values ~ groups, data,
alternative = "two.sided")
# or Method 2
var.test(x, y, alternative = "two.sided")
To import your data, use the following R code:
# If .txt tab file, use this
my_data <- read.delim(file.choose())
# Or, if .csv file, use this
my_data <- read.csv(file.choose())
Here, we’ll use the built-in R data set named ToothGrowth:
# Store the data in the variable my_data
my_data <- ToothGrowth
To have an idea of what the data look like, we start by displaying a random sample of 10 rows using the function sample_n()[in dplyr package]:
library("dplyr")
sample_n(my_data, 10)
len supp dose
43 23.6 OJ 1.0
28 21.5 VC 2.0
25 26.4 VC 2.0
56 30.9 OJ 2.0
46 25.2 OJ 1.0
7 11.2 VC 0.5
16 17.3 VC 1.0
4 5.8 VC 0.5
48 21.2 OJ 1.0
37 8.2 OJ 0.5
We want to test the equality of variances between the two groups OJ and VC in the column “supp”.
F-test is very sensitive to departure from the normal assumption. You need to check whether the data is normally distributed before using the F-test.
Shapiro-Wilk test can be used to test whether the normal assumption holds. It’s also possible to use Q-Q plot (quantile-quantile plot) to graphically evaluate the normality of a variable. Q-Q plot draws the correlation between a given sample and the normal distribution.
If there is doubt about normality, the better choice is to use Levene’s test or Fligner-Killeen test, which are less sensitive to departure from normal assumption.
# F-test
res.ftest <- var.test(len ~ supp, data = my_data)
res.ftest
F test to compare two variances
data: len by supp
F = 0.6386, num df = 29, denom df = 29, p-value = 0.2331
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3039488 1.3416857
sample estimates:
ratio of variances
0.6385951
The function var.test() returns a list containing the following components:
The format of the R code to use for getting these values is as follow:
# ratio of variances
res.ftest$estimate
ratio of variances
0.6385951
# p-value of the test
res.ftest$p.value
[1] 0.2331433
This analysis has been performed using R software (ver. 3.3.2).
To arrange multiple ggplot2 graphs on the same page, the standard R functions - par() and layout() - cannot be used.
The basic solution is to use the gridExtra R package, which comes with the following functions:
However, these functions makes no attempt at aligning the plot panels; instead, the plots are simply placed into the grid as they are, and so the axes are not aligned.
If axis alignment is required, you can switch to the cowplot package, which include the function plot_grid() with the argument align. However, the cowplot package doesn’t contain any solution for multi-pages layout. Therefore, we provide the function ggarrange() [in ggpubr], a wrapper around the plot_grid() function, to arrange multiple ggplots over multiple pages. It can also create a common unique legend for multiple plots.
This article will show you, step by step, how to combine multiple ggplots on the same page, as well as, over multiple pages, using helper functions available in the following R package: ggpubr R package, cowplot and gridExtra. We’ll also describe how to export the arranged plots to a file.
Related articles:
Contents:
You need to install the R package ggpubr (version >= 0.1.3), to easily create ggplot2-based publication ready plots.
We recommend to install the latest developmental version from GitHub as follow:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
If installation from Github failed, then try to install from CRAN as follow:
install.packages("ggpubr")
Note that, the installation of ggpubr will install the gridExtra and the cowplot package; so you don’t need to re-install them.
Load ggpubr:
library(ggpubr)
Data: ToothGrowth and mtcars data sets.
# ToothGrowth
data("ToothGrowth")
head(ToothGrowth)
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
# mtcars
data("mtcars")
mtcars$name <- rownames(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
head(mtcars[, c("name", "wt", "mpg", "cyl")])
name wt mpg cyl
Mazda RX4 Mazda RX4 2.620 21.0 6
Mazda RX4 Wag Mazda RX4 Wag 2.875 21.0 6
Datsun 710 Datsun 710 2.320 22.8 4
Hornet 4 Drive Hornet 4 Drive 3.215 21.4 6
Hornet Sportabout Hornet Sportabout 3.440 18.7 8
Valiant Valiant 3.460 18.1 6
Here, we’ll use ggplot2-based plotting functions available in ggpubr. You can use any ggplot2 functions to create the plots that you want for arranging them later.
We’ll start by creating 4 different plots:
You’ll learn how to combine these plots in the next sections using specific functions.
# Box plot (bp)
bxp <- ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")
bxp
# Dot plot (dp)
dp <- ggdotplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco", binwidth = 1)
dp
Create ordered bar plots. Change the fill color by the grouping variable “cyl”. Sorting will be done globally, but not by groups.
# Bar plot (bp)
bp <- ggbarplot(mtcars, x = "name", y = "mpg",
fill = "cyl", # change fill color by cyl
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in ascending order
sort.by.groups = TRUE, # Sort inside each group
x.text.angle = 90 # Rotate vertically x axis texts
)
bp + font("x.text", size = 8)
# Scatter plots (sp)
sp <- ggscatter(mtcars, x = "wt", y = "mpg",
add = "reg.line", # Add regression line
conf.int = TRUE, # Add confidence interval
color = "cyl", palette = "jco", # Color by groups "cyl"
shape = "cyl" # Change point shape by groups "cyl"
)+
stat_cor(aes(color = cyl), label.x = 3) # Add correlation coefficient
sp
To arrange multiple ggplots on one single page, we’ll use the function ggarrange()[in ggpubr], which is a wrapper around the function plot_grid() [in cowplot package]. Compared to the standard function plot_grid(), ggarange() can arrange multiple ggplots over multiple pages.
ggarrange(bxp, dp, bp + rremove("x.text"),
labels = c("A", "B", "C"),
ncol = 2, nrow = 2)
Alternatively, you can also use the function plot_grid() [in cowplot]:
library("cowplot")
plot_grid(bxp, dp, bp + rremove("x.text"),
labels = c("A", "B", "C"),
ncol = 2, nrow = 2)
or, the function grid.arrange() [in gridExtra]:
library("gridExtra")
grid.arrange(bxp, dp, bp + rremove("x.text"),
ncol = 2, nrow = 2)
R function: annotate_figure() [in ggpubr].
figure <- ggarrange(sp, bp + font("x.text", size = 10),
ncol = 1, nrow = 2)
annotate_figure(figure,
top = text_grob("Visualizing mpg", color = "red", face = "bold", size = 14),
bottom = text_grob("Data source: \n mtcars data set", color = "blue",
hjust = 1, x = 1, face = "italic", size = 10),
left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
right = "I'm done, thanks :-)!",
fig.lab = "Figure 1", fig.lab.face = "bold"
)
Note that, the function annotate_figure() supports any ggplots.
A real use case is, for example, when plotting survival curves with the risk table placed under the main plot.
To illustrate this case, we’ll use the survminer package. First, install it using install.packages(“survminer”), then type this:
# Fit survival curves
library(survival)
fit <- survfit( Surv(time, status) ~ adhere, data = colon )
# Plot survival curves
library(survminer)
ggsurv <- ggsurvplot(fit, data = colon,
palette = "jco", # jco palette
pval = TRUE, pval.coord = c(500, 0.4), # Add p-value
risk.table = TRUE # Add risk table
)
names(ggsurv)
[1] "plot" "table" "data.survplot" "data.survtable"
ggsurv is a list including the components:
You can arrange the survival plot and the risk table as follow:
ggarrange(ggsurv$plot, ggsurv$table, heights = c(2, 0.7),
ncol = 1, nrow = 2)
It can be seen that the axes of the survival plot and the risk table are not aligned vertically. To align them, specify the argument align as follow.
ggarrange(ggsurv$plot, ggsurv$table, heights = c(2, 0.7),
ncol = 1, nrow = 2, align = "v")
We’ll use nested ggarrange() functions to change column/row span of plots.
For example, using the R code below:
ggarrange(sp, # First row with scatter plot
ggarrange(bxp, dp, ncol = 2, labels = c("B", "C")), # Second row with box and dot plots
nrow = 2,
labels = "A" # Labels of the scatter plot
)
The combination of the functions ggdraw() + draw_plot() + draw_plot_label() [in cowplot] can be used to place graphs at particular locations with a particular size.
ggdraw(). Initialize an empty drawing canvas:
ggdraw()
Note that, by default, coordinates run from 0 to 1, and the point (0, 0) is in the lower left corner of the canvas (see the figure below).
draw_plot(). Places a plot somewhere onto the drawing canvas:
draw_plot(plot, x = 0, y = 0, width = 1, height = 1)
draw_plot_label(). Adds a plot label to the upper left corner of a graph. It can handle vectors of labels with associated coordinates.
draw_plot_label(label, x = 0, y = 1, size = 16, ...)
For example, you can combine multiple plots, with particular locations and different sizes, as follow:
library("cowplot")
ggdraw() +
draw_plot(bxp, x = 0, y = .5, width = .5, height = .5) +
draw_plot(dp, x = .5, y = .5, width = .5, height = .5) +
draw_plot(bp, x = 0, y = 0, width = 1, height = 0.5) +
draw_plot_label(label = c("A", "B", "C"), size = 15,
x = c(0, 0.5, 0), y = c(1, 1, 0.5))
The function arrangeGrop() [in gridExtra] helps to change the row/column span of a plot.
For example, using the R code below:
library("gridExtra")
grid.arrange(sp, # First row with one plot spaning over 2 columns
arrangeGrob(bxp, dp, ncol = 2), # Second row with 2 plots in 2 different columns
nrow = 2) # Number of rows
It’s also possible to use the argument layout_matrix in the grid.arrange() function, to create a complex layout.
In the R code below layout_matrix is a 2x2 matrix (2 columns and 2 rows). The first row is all 1s, that’s where the first plot lives, spanning the two columns; the second row contains plots 2 and 3 each occupying one column.
grid.arrange(bp, # bar plot spaning two columns
bxp, sp, # box plot and scatter plot
ncol = 2, nrow = 2,
layout_matrix = rbind(c(1,1), c(2,3)))
Note that, it’s also possible to annotate the output of the grid.arrange() function using the helper function draw_plot_label() [in cowplot].
To easily annotate the grid.arrange() / arrangeGrob() output (a gtable), you should first transform it to a ggplot using the function as_ggplot() [in ggpubr ]. Next you can annotate it using the function draw_plot_label() [in cowplot].
library("gridExtra")
library("cowplot")
# Arrange plots using arrangeGrob
# returns a gtable (gt)
gt <- arrangeGrob(bp, # bar plot spaning two columns
bxp, sp, # box plot and scatter plot
ncol = 2, nrow = 2,
layout_matrix = rbind(c(1,1), c(2,3)))
# Add labels to the arranged plots
p <- as_ggplot(gt) + # transform to a ggplot
draw_plot_label(label = c("A", "B", "C"), size = 15,
x = c(0, 0, 0.5), y = c(1, 0.5, 0.5)) # Add labels
p
In the above R code, we used arrangeGrob() instead of grid.arrange().
Note that, the main difference between these two functions is that, grid.arrange() draw automatically the output of the arranged plots.
As we want to annotate the arranged plots before drawing it, the function arrangeGrob() is preferred in this case.
The grid R package can be used to create a complex layout with the help of the function grid.layout(). It provides also the helper function viewport() to define a region or a viewport on the layout. The function print() is used to place plots in a specified region.
The different steps can be summarized as follow :
library(grid)
# Move to a new page
grid.newpage()
# Create layout : nrow = 3, ncol = 2
pushViewport(viewport(layout = grid.layout(nrow = 3, ncol = 2)))
# A helper function to define a region on the layout
define_region <- function(row, col){
viewport(layout.pos.row = row, layout.pos.col = col)
}
# Arrange the plots
print(sp, vp = define_region(row = 1, col = 1:2)) # Span over two columns
print(bxp, vp = define_region(row = 2, col = 1))
print(dp, vp = define_region(row = 2, col = 2))
print(bp + rremove("x.text"), vp = define_region(row = 3, col = 1:2))
To place a common unique legend in the margin of the arranged plots, the function ggarrange() [in ggpubr] can be used with the following arguments:
ggarrange(bxp, dp, labels = c("A", "B"),
common.legend = TRUE, legend = "bottom")
# Scatter plot colored by groups ("Species")
sp <- ggscatter(iris, x = "Sepal.Length", y = "Sepal.Width",
color = "Species", palette = "jco",
size = 3, alpha = 0.6)+
border()
# Marginal density plot of x (top panel) and y (right panel)
xplot <- ggdensity(iris, "Sepal.Length", fill = "Species",
palette = "jco")
yplot <- ggdensity(iris, "Sepal.Width", fill = "Species",
palette = "jco")+
rotate()
# Cleaning the plots
yplot <- yplot + clean_theme()
xplot <- xplot + clean_theme()
# Arranging the plot
ggarrange(xplot, NULL, sp, yplot,
ncol = 2, nrow = 2, align = "hv",
widths = c(2, 1), heights = c(1, 2),
common.legend = TRUE)
In this section, we’ll show how to plot a table and text alongside a chart. The iris data set will be used.
We start by creating the following plots:
We finish by arranging/combining the three plots using the function ggarrange() [in ggpubr]
# Density plot of "Sepal.Length"
#::::::::::::::::::::::::::::::::::::::
density.p <- ggdensity(iris, x = "Sepal.Length",
fill = "Species", palette = "jco")
# Draw the summary table of Sepal.Length
#::::::::::::::::::::::::::::::::::::::
# Compute descriptive statistics by groups
stable <- desc_statby(iris, measure.var = "Sepal.Length",
grps = "Species")
stable <- stable[, c("Species", "length", "mean", "sd")]
# Summary table plot, medium orange theme
stable.p <- ggtexttable(stable, rows = NULL,
theme = ttheme("mOrange"))
# Draw text
#::::::::::::::::::::::::::::::::::::::
text <- paste("iris data set gives the measurements in cm",
"of the variables sepal length and width",
"and petal length and width, respectively,",
"for 50 flowers from each of 3 species of iris.",
"The species are Iris setosa, versicolor, and virginica.", sep = " ")
text.p <- ggparagraph(text = text, face = "italic", size = 11, color = "black")
# Arrange the plots on the same page
ggarrange(density.p, stable.p, text.p,
ncol = 1, nrow = 3,
heights = c(1, 0.5, 0.3))
The function annotation_custom() [in ggplot2] can be used for adding tables, plots or other grid-based elements within the plotting area of a ggplot. The simplified format is :
annotation_custom(grob, xmin, xmax, ymin, ymax)
We’ll use the plots - density.p and stable.p - created in the previous section (@ref(mix-table-text-and-ggplot)).
density.p + annotation_custom(ggplotGrob(stable.p),
xmin = 5.5, ymin = 0.7,
xmax = 8)
As the inset box plot overlaps with some points, a transparent background is used for the box plots.
# Scatter plot colored by groups ("Species")
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
sp <- ggscatter(iris, x = "Sepal.Length", y = "Sepal.Width",
color = "Species", palette = "jco",
size = 3, alpha = 0.6)
# Create box plots of x/y variables
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Box plot of the x variable
xbp <- ggboxplot(iris$Sepal.Length, width = 0.3, fill = "lightgray") +
rotate() +
theme_transparent()
# Box plot of the y variable
ybp <- ggboxplot(iris$Sepal.Width, width = 0.3, fill = "lightgray") +
theme_transparent()
# Create the external graphical objects
# called a "grop" in Grid terminology
xbp_grob <- ggplotGrob(xbp)
ybp_grob <- ggplotGrob(ybp)
# Place box plots inside the scatter plot
#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
xmin <- min(iris$Sepal.Length); xmax <- max(iris$Sepal.Length)
ymin <- min(iris$Sepal.Width); ymax <- max(iris$Sepal.Width)
yoffset <- (1/15)*ymax; xoffset <- (1/15)*xmax
# Insert xbp_grob inside the scatter plot
sp + annotation_custom(grob = xbp_grob, xmin = xmin, xmax = xmax,
ymin = ymin-yoffset, ymax = ymin+yoffset) +
# Insert ybp_grob inside the scatter plot
annotation_custom(grob = ybp_grob,
xmin = xmin-xoffset, xmax = xmin+xoffset,
ymin = ymin, ymax = ymax)
Import the background image. Use either the function readJPEG() [in jpeg package] or the function readPNG() [in png package] depending on the format of the background image.
To test the example below, make sure that the png package is installed. You can install it using install.packages(“png”) R command.
# Import the image
img.file <- system.file(file.path("images", "background-image.png"),
package = "ggpubr")
img <- png::readPNG(img.file)
Combine a ggplot with the background image. R function: background_image() [in ggpubr].
library(ggplot2)
library(ggpubr)
ggplot(iris, aes(Species, Sepal.Length))+
background_image(img)+
geom_boxplot(aes(fill = Species), color = "white")+
fill_palette("jco")
Change box plot fill color transparency by specifying the argument alpha. Value should be in [0, 1], where 0 is full transparency and 1 is no transparency.
library(ggplot2)
library(ggpubr)
ggplot(iris, aes(Species, Sepal.Length))+
background_image(img)+
geom_boxplot(aes(fill = Species), color = "white", alpha = 0.5)+
fill_palette("jco")
Another example, overlaying the France map and a ggplot2:
mypngfile <- download.file("https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/France_Flag_Map.svg/612px-France_Flag_Map.svg.png",
destfile = "france.png", mode = 'wb')
img <- png::readPNG('france.png')
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
background_image(img)+
geom_point(aes(color = Species), alpha = 0.6, size = 5)+
color_palette("jco")+
theme(legend.position = "top")
If you have a long list of ggplots, say n = 20 plots, you may want to arrange the plots and to place them on multiple pages. With 4 plots per page, you need 5 pages to hold the 20 plots.
The function ggarrange() [in ggpubr] provides a convenient solution to arrange multiple ggplots over multiple pages. After specifying the arguments nrow and ncol, the function ggarrange() computes automatically the number of pages required to hold the list of the plots. It returns a list of arranged ggplots.
For example the following R code,
multi.page <- ggarrange(bxp, dp, bp, sp,
nrow = 1, ncol = 2)
returns a list of two pages with two plots per page. You can visualize each page as follow:
multi.page[[1]] # Visualize page 1
multi.page[[2]] # Visualize page 2
You can also export the arranged plots to a pdf file using the function ggexport() [in ggpubr]:
ggexport(multi.page, filename = "multi.page.ggplot2.pdf")
PDF file: Multi.page.ggplot2
Note that, it’s also possible to use the function marrangeGrob() [in gridExtra] to create a multi-pages output.
library(gridExtra)
res <- marrangeGrob(list(bxp, dp, bp, sp), nrow = 1, ncol = 2)
# Export to a pdf file
ggexport(res, filename = "multi.page.ggplot2.pdf")
# Visualize interactively
res
We’ll arrange the plot created in section (@ref(mix-table-text-and-ggplot)) and (@ref(create-some-plots)).
p1 <- ggarrange(sp, bp + font("x.text", size = 9),
ncol = 1, nrow = 2)
p2 <- ggarrange(density.p, stable.p, text.p,
ncol = 1, nrow = 3,
heights = c(1, 0.5, 0.3))
ggarrange(p1, p2, ncol = 2, nrow = 1)
R function: ggexport() [in ggpubr].
First, create a list of 4 ggplots corresponding to the variables Sepal.Length, Sepal.Width, Petal.Length and Petal.Width in the iris data set.
plots <- ggboxplot(iris, x = "Species",
y = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
color = "Species", palette = "jco"
)
plots[[1]] # Print the first plot
plots[[2]] # Print the second plots and so on...
Next, you can export individual plots to a file (pdf, eps or png) (one plot per page). It’s also possible to arrange the plots (2 plot per page) when exporting them.
Export individual plots to a pdf file (one plot per page):
ggexport(plotlist = plots, filename = "test.pdf")
Arrange and export. Specify nrow and ncol to display multiple plots on the same page:
ggexport(plotlist = plots, filename = "test.pdf",
nrow = 2, ncol = 1)
We sincerely thank all developers for their efforts behind the packages that ggpubr depends on, namely:
This analysis has been performed using R software (ver. 3.3.2) and ggpubr (ver. 0.1.4.999).
This article describes how to create easily basic and ordered bar plots using ggplot2 based helper functions available in the ggpubr R package. We’ll also present some modern alternatives to bar plots, including lollipop charts and cleveland’s dot plots.
Note that, the approach to build a bar plot, using ggplot2 standard verbs, has been described in our previous article available at: ggplot2 barplots : Quick start guide.
You might be also interested by the following articles:
Contents:
You need to install the R package ggpubr (version >= 0.1.3), to easily create ggplot2-based publication ready plots.
Install from CRAN:
install.packages("ggpubr")
Or, install the latest developmental version from GitHub as follow:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
Load ggpubr:
library(ggpubr)
Create a demo data set:
df <- data.frame(dose=c("D0.5", "D1", "D2"),
len=c(4.2, 10, 29.5))
print(df)
dose len
1 D0.5 4.2
2 D1 10.0
3 D2 29.5
Basic bar plots:
# Basic bar plots with label
p <- ggbarplot(df, x = "dose", y = "len",
color = "black", fill = "lightgray")
p
# Rotate to create horizontal bar plots
p + rotate()
Change fill and outline colors by groups:
ggbarplot(df, x = "dose", y = "len",
fill = "dose", color = "dose", palette = "jco")
Create a demo data set:
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
dose=rep(c("D0.5", "D1", "D2"),2),
len=c(6.8, 15, 33, 4.2, 10, 29.5))
print(df2)
supp dose len
1 VC D0.5 6.8
2 VC D1 15.0
3 VC D2 33.0
4 OJ D0.5 4.2
5 OJ D1 10.0
6 OJ D2 29.5
Plot y = “len” by x = “dose” and change color by a second group: “supp”
# Stacked bar plots, add labels inside bars
ggbarplot(df2, x = "dose", y = "len",
fill = "supp", color = "supp",
palette = c("gray", "black"),
label = TRUE, lab.col = "white", lab.pos = "in")
# Change position: Interleaved (dodged) bar plot
ggbarplot(df2, x = "dose", y = "len",
fill = "supp", color = "supp",
palette = c("gray", "black"),
position = position_dodge(0.9))
Load and prepare data:
# Load data
data("mtcars")
dfm <- mtcars
# Convert the cyl variable to a factor
dfm$cyl <- as.factor(dfm$cyl)
# Add the name colums
dfm$name <- rownames(dfm)
# Inspect the data
head(dfm[, c("name", "wt", "mpg", "cyl")])
name wt mpg cyl
Mazda RX4 Mazda RX4 2.620 21.0 6
Mazda RX4 Wag Mazda RX4 Wag 2.875 21.0 6
Datsun 710 Datsun 710 2.320 22.8 4
Hornet 4 Drive Hornet 4 Drive 3.215 21.4 6
Hornet Sportabout Hornet Sportabout 3.440 18.7 8
Valiant Valiant 3.460 18.1 6
Create ordered bar plots. Change the fill color by the grouping variable “cyl”. Sorting will be done globally, but not by groups.
ggbarplot(dfm, x = "name", y = "mpg",
fill = "cyl", # change fill color by cyl
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "desc", # Sort the value in dscending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90 # Rotate vertically x axis texts
)
Sort bars inside each group. Use the argument sort.by.groups = TRUE.
ggbarplot(dfm, x = "name", y = "mpg",
fill = "cyl", # change fill color by cyl
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in dscending order
sort.by.groups = TRUE, # Sort inside each group
x.text.angle = 90 # Rotate vertically x axis texts
)
The deviation graph shows the deviation of quantitative values to a reference value. In the R code below, we’ll plot the mpg z-score from the mtcars data set.
Calculate the z-score of the mpg data:
# Calculate the z-score of the mpg data
dfm$mpg_z <- (dfm$mpg -mean(dfm$mpg))/sd(dfm$mpg)
dfm$mpg_grp <- factor(ifelse(dfm$mpg_z < 0, "low", "high"),
levels = c("low", "high"))
# Inspect the data
head(dfm[, c("name", "wt", "mpg", "mpg_z", "mpg_grp", "cyl")])
name wt mpg mpg_z mpg_grp cyl
Mazda RX4 Mazda RX4 2.620 21.0 0.1508848 high 6
Mazda RX4 Wag Mazda RX4 Wag 2.875 21.0 0.1508848 high 6
Datsun 710 Datsun 710 2.320 22.8 0.4495434 high 4
Hornet 4 Drive Hornet 4 Drive 3.215 21.4 0.2172534 high 6
Hornet Sportabout Hornet Sportabout 3.440 18.7 -0.2307345 low 8
Valiant Valiant 3.460 18.1 -0.3302874 low 6
Create an ordered bar plot, colored according to the level of mpg:
ggbarplot(dfm, x = "name", y = "mpg_z",
fill = "mpg_grp", # change fill color by mpg_level
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in ascending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90, # Rotate vertically x axis texts
ylab = "MPG z-score",
xlab = FALSE,
legend.title = "MPG Group"
)
Rotate the plot: use rotate = TRUE and sort.val = “desc”
ggbarplot(dfm, x = "name", y = "mpg_z",
fill = "mpg_grp", # change fill color by mpg_level
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "desc", # Sort the value in descending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 90, # Rotate vertically x axis texts
ylab = "MPG z-score",
legend.title = "MPG Group",
rotate = TRUE,
ggtheme = theme_minimal()
)
Lollipop chart is an alternative to bar plots, when you have a large set of values to visualize.
Lollipop chart colored by the grouping variable “cyl”:
ggdotchart(dfm, x = "name", y = "mpg",
color = "cyl", # Color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
sorting = "ascending", # Sort value in descending order
add = "segments", # Add segments from y = 0 to dots
ggtheme = theme_pubr() # ggplot2 theme
)
ggdotchart(dfm, x = "name", y = "mpg",
color = "cyl", # Color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
sorting = "descending", # Sort value in descending order
add = "segments", # Add segments from y = 0 to dots
rotate = TRUE, # Rotate vertically
group = "cyl", # Order by groups
dot.size = 6, # Large dot size
label = round(dfm$mpg), # Add mpg values as dot labels
font.label = list(color = "white", size = 9,
vjust = 0.5), # Adjust label parameters
ggtheme = theme_pubr() # ggplot2 theme
)
Deviation graph:
ggdotchart(dfm, x = "name", y = "mpg_z",
color = "cyl", # Color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
sorting = "descending", # Sort value in descending order
add = "segments", # Add segments from y = 0 to dots
add.params = list(color = "lightgray", size = 2), # Change segment color and size
group = "cyl", # Order by groups
dot.size = 6, # Large dot size
label = round(dfm$mpg_z,1), # Add mpg values as dot labels
font.label = list(color = "white", size = 9,
vjust = 0.5), # Adjust label parameters
ggtheme = theme_pubr() # ggplot2 theme
)+
geom_hline(yintercept = 0, linetype = 2, color = "lightgray")
Color y text by groups. Use y.text.col = TRUE.
ggdotchart(dfm, x = "name", y = "mpg",
color = "cyl", # Color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
sorting = "descending", # Sort value in descending order
rotate = TRUE, # Rotate vertically
dot.size = 2, # Large dot size
y.text.col = TRUE, # Color y text by groups
ggtheme = theme_pubr() # ggplot2 theme
)+
theme_cleveland() # Add dashed grids
This analysis has been performed using R software (ver. 3.3.2) and ggpubr (ver. 0.1.4).