This chapter describes two key R packages for creating interactive network graphs. These packages include:
You’ll learn how to:
Contents:
We’ll use the phone.call2
data [in the navdata
R package], which is a list containing the nodes and the edges list prepared in the chapter @ref(network-visualization-essentials) from the phone.call
data.
Start by loading the tidyverse R package and the phone.call2 demo data sets:
library(tidyverse)
library("navdata")
data("phone.call2")
nodes <- phone.call2$nodes
edges <- phone.call2$edges
Can be used to easily create an interactive sankey diagram, as well as, other network layout such as dendrogram, radial and diagnonal networks.
Key R functions:
forceNetwork()
. Creates a D3 JavaScript force directed network graph
forceNetwork(Links, Nodes, Source, Target,
Value, NodeID, Nodesize, Group)
Key Arguments:
Links
: edges list. Edge IDs should start with 0Nodes
: Nodes list. Node IDs should start with 0Source
, Target
: the names of the column, in the edges data, containing the network source and target variables, respectively.Value
: the name of the column, in the edge data, containing the weight values for edges. Used to indicate how wide the links are.NodeID
: the name of the column, in the nodes data, containing the node IDs. Used for labeling the nodes.Nodesize
: the name of the column, in the nodes data, with some value to vary the node radius’s with.Group
: the name of the column, in the nodes data, specifying the group of each node.As specified above, the IDs in nodes and edges lists should be numeric values starting with 0. This can be easily done by substracting 1 from the existing IDs in the two data frames.
nodes_d3 <- mutate(nodes, id = id - 1)
edges_d3 <- mutate(edges, from = from - 1, to = to - 1)
library(networkD3)
forceNetwork(
Links = edges_d3, Nodes = nodes_d3,
Source = "from", Target = "to", # so the network is directed.
NodeID = "label", Group = "id", Value = "weight",
opacity = 1, fontSize = 16, zoom = TRUE
)
Note that, a color is attributed to each group. Here, as we specified the column “id” as the node Group value, we have different colors for each individual nodes.
You can create a d3-styled sankey diagram. A Sankey diagram is a good fit for the phone call data. There are not too many nodes in the data, making it easier to visualize the flow of phone calls.
Create a sankey diagram:
sankeyNetwork(
Links = edges_d3, Nodes = nodes_d3,
Source = "from", Target = "to",
NodeID = "label", Value = "weight",
fontSize = 16, unit = "Letter(s)")
Other hierarchical layouts exist in the network3D package to visualize tree-like graphs. In the example below, we start by computing hierarchical clustering using a sample of the USArrests data set:
set.seed(123)
hc <- USArrests %>% sample_n(15) %>%
scale() %>% dist() %>%
hclust(method = "complete")
dendroNetwork(hc, fontSize = 15)
Other alternatives are:
radialNetwork(
as.radialNetwork(hc), fontSize = 15
)
diagonalNetwork(
as.radialNetwork(hc), fontSize = 15
)
igraph
package.rpart
package.igraph
layoutsKey R function:
visNetwork(
nodes = NULL, edges = NULL,
width = NULL, height = NULL,
main = NULL, submain = NULL, footer = NULL
)
Key Arguments:
nodes
: nodes list information. Should contain at least the column “id”. See visNodes()
for more options to control nodes. Other colums can be included in the data, such as:
visGroups()
.edges
: edges list information. Required at least columns “from” and “to”. See visEdges()
for more options to control edges.
Note that, the function plots the labels for the nodes, using the “label” column in the node list.
You can move the nodes and the graph will use an algorithm to keep the nodes properly spaced. You can also zoom in and out on the plot and move it around to re-center it.
To have always the same network, you can use the function visLayout(randomSeed = 12)
:
library("visNetwork")
visNetwork(nodes, edges) %>%
visLayout(randomSeed = 12)
Note that,
visNetwork
can use igraph
layouts, which include a large variety of possible layouts.
visIgraph()
to directly visualize an igraph network object.
If you want to control the width of edges according to a variable, you should include the column “width” in the edge list data. You should manually calculate and scale the edge width.
In the following R code, we’ll customize the visNetwork()
output by using an igraph
layout and changing the edges width.
First add the column width in the edges list data frame. Set the minimum width to 1:
edges <- mutate(edges, width = 1 + weight/5)
Create the network graph with the variable edge widths and the igraph layout = “layout_with_fr”.
visNetwork(nodes, edges) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(arrows = "middle") %>%
visLayout(randomSeed = 1234)
As mentioned above, you can visualize classification and regression trees generated using the rpart
package.
Key function: visTree()
[in visNetwork version >= 2.0.0
].
For example, to visualize a classification tree, type the following R code:
# Compute
library(rpart)
res <- rpart(Species~., data=iris)
# Visualize
visTree(res, main = "Iris classification Tree",
width = "80%", height = "400px")
Allaire, J.J., Christopher Gandrud, Kenton Russell, and CJ Yetman. 2017. NetworkD3: D3 Javascript Network Graphs from R. https://CRAN.R-project.org/package=networkD3.
Almende B.V., Benoit Thieurmel, and Titouan Robert. 2017. VisNetwork: Network Visualization Using ’Vis.js’ Library. https://CRAN.R-project.org/package=visNetwork.
This chapter describes how to manipulate and analyze a network graph in R using the tidygraph
package.
The tidygraph
package provides a tidy framework to easily manipulate different types of relational data, including: graph, network and trees.
In the tidygraph framework, network data are considered as two tidy data tables, one describing the node data and the other is for edge data. The package provides a simple solution to switch between the two tables and provides dplyr
verbs for manipulating them.
You will learn methods for detecting important or central entities in a network graph. We’ll also introduce how to detect community (or cluster) in a network.
Contents:
tidyverse
for general data manipulation and visualization.tidygraph
for manipulating and analyzing network graphs.ggraph
for visualizing network objects created using the tidygraph package.library(tidyverse)
library(tidygraph)
library(ggraph)
Key R functions:
tbl_graph()
. Creates a network object from nodes and edges dataas_tbl_graph()
. Converts network data and objects to a tbl_graph network.Demo data set: phone.call2
data [in the navdata
R package], which is a list containing the nodes and the edges list prepared in the chapter @ref(network-visualization-essentials).
library("navdata")
data("phone.call2")
phone.net <- tbl_graph(
nodes = phone.call2$nodes,
edges = phone.call2$edges,
directed = TRUE
)
ggraph(phone.net, layout = "graphopt") +
geom_edge_link(width = 1, colour = "lightgray") +
geom_node_point(size = 4, colour = "#00AFBB") +
geom_node_text(aes(label = label), repel = TRUE)+
theme_graph()
One can also use the as_tbl_graph()
function to converts the following data structure and network objects:
data.frame
, list
and matrix
data [R base]igraph
network objects [igraph package]network
network objects [network pakage]dendrogram
and hclust
[stats package]Node
[data.tree package]phylo
and evonet
[ape package]graphNEL
, graphAM
, graphBAM
[graph package (in Bioconductor)]In the following example, we’ll create a correlation matrix network graph. The mtcars
data set will be used.
Compute the correlation matrix between cars using the corrr
package:
correlate()
shave()
library(corrr)
res.cor <- mtcars [, c(1, 3:6)] %>% # (1)
t() %>% correlate() %>% # (2)
shave(upper = TRUE) %>% # (3)
stretch(na.rm = TRUE) %>% # (4)
filter(r >= 0.998) # (5)
res.cor
## # A tibble: 59 x 3
## x y r
##
## 1 Mazda RX4 Mazda RX4 Wag 1.000
## 2 Mazda RX4 Merc 230 1.000
## 3 Mazda RX4 Merc 280 0.999
## 4 Mazda RX4 Merc 280C 0.999
## 5 Mazda RX4 Merc 450SL 0.998
## 6 Mazda RX4 Wag Merc 230 1.000
## # ... with 53 more rows
Create the correlation network graph:
set.seed(1)
cor.graph <- as_tbl_graph(res.cor, directed = FALSE)
ggraph(cor.graph) +
geom_edge_link() +
geom_node_point() +
geom_node_text(
aes(label = name), size = 3, repel = TRUE
) +
theme_graph()
cor.graph
## # A tbl_graph: 24 nodes and 59 edges
## #
## # An undirected simple graph with 3 components
## #
## # Node Data: 24 x 1 (active)
## name
##
## 1 Mazda RX4
## 2 Mazda RX4 Wag
## 3 Datsun 710
## 4 Hornet 4 Drive
## 5 Hornet Sportabout
## 6 Valiant
## # ... with 18 more rows
## #
## # Edge Data: 59 x 3
## from to r
##
## 1 1 2 1.000
## 2 1 20 1.000
## 3 1 8 0.999
## # ... with 56 more rows
The output shows:
The notion of an active tibble within a tbl_graph object makes it possible to manipulate the data in one tibble at a time. The nodes tibble is activated by default, but you can change which tibble is active with the activate()
function.
If you want to rearrange the rows in the edges tibble to list those with the highest “r” first, you could use activate()
and then arrange()
. For example, type the following R code:
cor.graph %>%
activate(edges) %>%
arrange(desc(r))
Note that, to extract the current active data as a tibble, you can use the function as_tibble(cor.graph)
.
With the tidygraph
package, you can easily manipulate the nodes and the edges data in the network graph object using dplyr
verbs. For example, you can add new columns or rename columns in the nodes/edges data.
You can also filter and arrange the data. Note that, applying filter()/slice()
on node data will remove the edges terminating at the removed nodes.
In this section we’ll manipulate the correlation network graph.
You can use the dplyr
verbs as follow:
# Car groups info
cars.group <- data_frame(
name = rownames(mtcars),
cyl = as.factor(mtcars$cyl)
)
# Modify the nodes data
cor.graph <- cor.graph %>%
activate(nodes) %>%
left_join(cars.group, by = "name") %>%
rename(label = name)
cor.graph <- cor.graph %>%
activate(edges) %>%
rename(weight = r)
cor.graph
## # A tbl_graph: 24 nodes and 59 edges
## #
## # An undirected simple graph with 3 components
## #
## # Edge Data: 59 x 3 (active)
## from to weight
##
## 1 1 2 1.000
## 2 1 20 1.000
## 3 1 8 0.999
## 4 1 9 0.999
## 5 1 11 0.998
## 6 2 20 1.000
## # ... with 53 more rows
## #
## # Node Data: 24 x 2
## label cyl
##
## 1 Mazda RX4 6
## 2 Mazda RX4 Wag 6
## 3 Datsun 710 4
## # ... with 21 more rows
set.seed(1)
ggraph(cor.graph) +
geom_edge_link(aes(width = weight), alpha = 0.2) +
scale_edge_width(range = c(0.2, 1)) +
geom_node_point(aes(color = cyl), size = 2) +
geom_node_text(aes(label = label), size = 3, repel = TRUE) +
theme_graph()
In this sections, we described methods for detecting important or central entities in a network graph. We’ll also introduce how to detect community (or cluster) in a network.
Centrality is an important concept when analyzing network graph. The centrality of a node / edge measures how central (or important) is a node or edge in the network.
We consider an entity important, if he has connections to many other entities. Centrality describes the number of edges that are connected to nodes.
There many types of scores that determine centrality. One of the famous ones is the pagerank algorithm that was powering Google Search in the beginning.
Examples of common approaches of measuring centrality include:
betweenness centrality. The betweenness centrality for each nodes is the number of the shortest paths that pass through the nodes.
closeness centrality. Closeness centrality measures how many steps is required to access every other nodes from a given nodes. It describes the distance of a node to all other nodes. The more central a node is, the closer it is to all other nodes.
eigenvector centrality. A node is important if it is linked to by other important nodes. The centrality of each node is proportional to the sum of the centralities of those nodes to which it is connected. In general, nodes with high eigenvector centralities are those which are linked to many other nodes which are, in turn, connected to many others (and so on).
Hub and authority centarlities are generalization of eigenvector centrality. A high hub node points to many good authorities and a high authority node receives from many good hubs.
The tidygraph
package contains more than 10 centrality measures, prefixed with the term centrality_
. These measures include:
centrality_authority()
centrality_betweenness()
centrality_closeness()
centrality_hub()
centrality_pagerank()
centrality_eigen()
centrality_edge_betweenness()
All of these centrality functions returns a numeric vector matching the nodes (or edges in the case of `centrality_edge_betweenness()).
In the following examples, we’ll use the phone call network graph. We’ll change the color and the size of nodes according to their values of centrality.
set.seed(123)
phone.net %>%
activate(nodes) %>%
mutate(centrality = centrality_authority()) %>%
ggraph(layout = "graphopt") +
geom_edge_link(width = 1, colour = "lightgray") +
geom_node_point(aes(size = centrality, colour = centrality)) +
geom_node_text(aes(label = label), repel = TRUE)+
scale_color_gradient(low = "yellow", high = "red")+
theme_graph()
For a given problem at hand, you can test the different centrality score to decide which centrality measure makes most sense for your specific question.
Clustering is a common operation in network analysis and it consists of grouping nodes based on the graph topology.
It’s sometimes referred to as community detection based on its commonality in social network analysis.
Many clustering algorithms from are available in the tidygraph
package and prefixed with the term group_
. These include:
group_infomap()
group_edge_betweenness()
.In the following example, we’ll use the correlation network graphs to detect clusters or communities:
set.seed(123)
cor.graph %>%
activate(nodes) %>%
mutate(community = as.factor(group_infomap())) %>%
ggraph(layout = "graphopt") +
geom_edge_link(width = 1, colour = "lightgray") +
geom_node_point(aes(colour = community), size = 4) +
geom_node_text(aes(label = label), repel = TRUE)+
theme_graph()
Three communities are detected.
Network Analysis is used to investigate and visualize the inter-relationship between entities (individuals, things).
Examples of network structures, include: social media networks, friendship networks, collaboration networks and disease transmission.
Network and graph theory are extensively used across different fields, such as in biology (pathway analysis and protein-protein interaction visualization), finance, social sciences, economics, communication, history, computer science, etc.
In this chapter, you’ll learn:
Contents:
Network graphs are characterized by two key terms: nodes and edges
nodes: The entities (individual actors, people, or things) to be connected in the network. Synonyms: vertices of a graph.
edges: The connections (interactions or relationships) between the entities. Synonyms: links, ties.
adjacency matrix: a square matrix in which the column and row names are the nodes of the network. This is a standard data format accepted by many network analysis packages in R. Synonyms: sociomatrices. Within the matrix a 1 specifies that there is a link between the nodes, and a 0 indicates no link.
edge list: a data frame containing at least two columns: one column of nodes corresponding to the source of a connection and another column of nodes that contains the target of the connection. The nodes in the data are identified by unique IDs.
Node list: a data frame with a single column listing the node IDs found in the edge list. You can also add attribute columns to the data frame such as the names of the nodes or grouping variables.
Weighted network graph: An edge list can also contain additional columns describing attributes of the edges such as a magnitude aspect for an edge. If the edges have a magnitude attribute the graph is considered weighted.
Directed and undirected network graph:
If the distinction between source and target is meaningful, the network is directed. Directed edges represent an ordering of nodes, like a relationship extending from one nodes to another, where switching the direction would change the structure of the network. The World Wide Web is an example of a directed network because hyperlinks connect one Web page to another, but not necessarily the other way around (Tyner, Briatte, and Hofmann 2017).
If the distinction is not meaningful, the network is undirected. Undirected edges are simply links between nodes where order does not matter. Co-authorship networks represent examples of undirected networks, where nodes are authors and they are connected by an edge if they have written a publication together (Tyner, Briatte, and Hofmann 2017).
Another example: When people send e-mail to each other, the distinction between the sender (source) and the recipient (target) is clearly meaningful, therefore the network is directed.
navdata
: contains data sets required for this booktidyverse
: for general data manipulationigraph
, tidygraph
and ggraph
: for network visualizationnavdata
R package:if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/navdata")
install.packages(
c("tidyverse", "igraph", "tidygraph", "ggraph")
)
We’ll use a fake demo data set containing the number of phone calls between the president of some EU countries.
library("navdata")
data("phone.call")
head(phone.call, 3)
## # A tibble: 3 x 3
## source destination n.call
##
## 1 France Germany 9
## 2 Belgium France 4
## 3 France Spain 3
Nodes are countries in the source and destination columns. The values, in the column n.call
, will be used as edges weight.
To visualize the network graph, we need to create two data frames from the demo data sets:
In the following sections, we start by creating nodes and edges lists. Next, we’ll use the different packages to create network graphs.
First, load the tidyverse
R package for data manipulation:
library(tidyverse)
Then, compute the following key steps to create nodes list:
label
# Get distinct source names
sources <- phone.call %>%
distinct(source) %>%
rename(label = source)
# Get distinct destination names
destinations <- phone.call %>%
distinct(destination) %>%
rename(label = destination)
# Join the two data to create node
# Add unique ID for each country
nodes <- full_join(sources, destinations, by = "label")
nodes <- nodes %>%
mutate(id = 1:nrow(nodes)) %>%
select(id, everything())
head(nodes, 3)
## # A tibble: 3 x 2
## id label
##
## 1 1 France
## 2 2 Belgium
## 3 3 Germany
Key steps:
# Rename the n.call column to weight
phone.call <- phone.call %>%
rename(weight = n.call)
# (a) Join nodes id for source column
edges <- phone.call %>%
left_join(nodes, by = c("source" = "label")) %>%
rename(from = id)
# (b) Join nodes id for destination column
edges <- edges %>%
left_join(nodes, by = c("destination" = "label")) %>%
rename(to = id)
# (c) Select/keep only the columns from and to
edges <- select(edges, from, to, weight)
head(edges, 3)
## # A tibble: 3 x 3
## from to weight
##
## 1 1 3 9
## 2 2 1 4
## 3 1 8 3
There are many tools and software to analyse and visualize network graphs. However, for a reproducible and automatized research you need a programming environment such as in R software.
In this section, we review major R packages for reproducible network analysis and visualization.
We’ll introduce how to create static network graphs using igraph
(file. 2017) and tidygraph
(Pedersen 2017b) + ggraph
(Pedersen 2017a) packages.
Note that, igraph
packages uses the R base plotting system. The ggraph
package is based on ggplot2 plotting system, which is highly flexible.
If you are new to network analysis in R, we highly recommend to learn the tidygraph
and the ggraph
package for the analysis and the visualization, respectively.
Note that, each time that you create a network graph, you need to set the random generator to always have the same layout. For example, you can type type this: set.seed(123)
Key R function: graph_from_data_frame()
.
d
: edge listvertices
: node listdirected
: can be either TRUE or FALSE depending on whether the data is directed or undirected.library(igraph)
net.igraph <- graph_from_data_frame(
d = edges, vertices = nodes,
directed = TRUE
)
set.seed(123)
plot(net.igraph, edge.arrow.size = 0.2,
layout = layout_with_graphopt)
See the documentation by typing ?plot.igraph
, for more options to customize the plot.
tidygraph
and ggraph
are modern R packages for network data manipulation (tidygraph
) and visualization (ggraph
). They leverage the power of igraph.
tbl_graph()
.nodes
, edges
and directed
.library(tidygraph)
net.tidy <- tbl_graph(
nodes = nodes, edges = edges, directed = TRUE
)
Key functions:
geom_node_point()
: Draws node points.
geom_edge_link()
: Draws edge links. To control the width of edge line according to the weight variable, specify the option aes(width = weight)
, where, the weight specify the number of phone.call sent along each route. In this case, you can control the maximum and minimum width of the edges, by using the function scale_edge_width()
to set the range (minimum and maximum width value). For example: scale_edge_width(range = c(0.2, 2)).
geom_node_text()
: Adds text labels for nodes, by specifying the argument aes(label = label)
. To avoid text overlapping, indicate the option repel = TRUE
.
labs()
: Change main titles, axis labels and legend titles.
Create a classic node-edge diagrams. Possible values for the argument layout
include: 'star', 'circle', 'gem', 'dh', 'graphopt', 'grid', 'mds', 'randomly', 'fr', 'kk', 'drl', 'lgl'
.
library(ggraph)
ggraph(net.tidy, layout = "graphopt") +
geom_node_point() +
geom_edge_link(aes(width = weight), alpha = 0.8) +
scale_edge_width(range = c(0.2, 2)) +
geom_node_text(aes(label = label), repel = TRUE) +
labs(edge_width = "phone.call") +
theme_graph()
Layout defines the placement of nodes and edges in a given graph structure. There are different types of possible layouts (https://www.data-imaginist.com/2017/ggraph-introduction-layouts/). You should use the layout that suit the best your graph data structure.
In this section, we’ll describe some of the layouts, including:
linear
: Arranges the nodes linearly or circularly in order to make an arc diagram.treemap
: Creates a treemap from the graph, that is, a space-filing subdivision of rectangles showing a weighted hierarchy.In the following example, we’ll:
layout = "linear"
.In the following arc diagram, the edges above the horizontal line move from left to right, while the edges below the line move from right to left.
# Arc diagram
ggraph(net.tidy, layout = "linear") +
geom_edge_arc(aes(width = weight), alpha = 0.8) +
scale_edge_width(range = c(0.2, 2)) +
geom_node_text(aes(label = label), repel = TRUE) +
labs(edge_width = "Number of calls") +
theme_graph()+
theme(legend.position = "top")
# Coord diagram, circular
ggraph(net.tidy, layout = "linear", circular = TRUE) +
geom_edge_arc(aes(width = weight), alpha = 0.8) +
scale_edge_width(range = c(0.2, 2)) +
geom_node_text(aes(label = label), repel = TRUE) +
labs(edge_width = "Number of calls") +
theme_graph()+
theme(legend.position = "top")
A treemap is a visual method for displaying hierarchical data that uses nested rectangles to represent the branches of a tree diagram. Each rectangles has an area proportional to the amount of data it represents.
To illustrate this layout, we’ll use the france.trade
demo data set [ in navdata
package]. It contains the trading percentage between France and different countries.
data("france.trade")
france.trade
## # A tibble: 251 x 3
## source destination trade.percentage
##
## 1 France Aruba 0.035
## 2 France Afghanistan 0.035
## 3 France Angola 0.035
## 4 France Anguilla 0.035
## 5 France Albania 0.035
## 6 France Finland 0.035
## # ... with 245 more rows
Nodes are the distinct countries in the source and the destination columns.
# Distinct countries
countries <- c(
france.trade$source, france.trade$destination
) %>%
unique()
# Create nodes list
nodes <- data_frame(
id = 1:length(countries),
label = countries
)
nodes <- nodes %>%
left_join(
france.trade[, c("destination", "trade.percentage")],
by = c("label" = "destination" )
) %>%
mutate(
trade.percentage = ifelse(
is.na(trade.percentage), 0, trade.percentage
)
)
head(nodes, 3)
## # A tibble: 3 x 3
## id label trade.percentage
##
## 1 1 France 0.000
## 2 2 Aruba 0.035
## 3 3 Afghanistan 0.035
per_route <- france.trade %>%
select(source, destination)
# (a) Join nodes id for source column
edges <- per_route %>%
left_join(nodes, by = c("source" = "label")) %>%
rename(from = id)
# (b) Join nodes id for destination column
edges <- edges %>%
left_join(nodes, by = c("destination" = "label")) %>%
rename(to = id)
# (c) Select/keep only the columns from and to
edges <- select(edges, from, to)
head(edges, 3)
## # A tibble: 3 x 2
## from to
##
## 1 1 2
## 2 1 3
## 3 1 4
trade.graph <- tbl_graph(
nodes = nodes, edges = edges, directed = TRUE
)
# Generate colors for each country
require(ggpubr)
cols <- get_palette("Dark2", nrow(france.trade)+1)
# Visualize
set.seed(123)
ggraph(trade.graph, 'treemap', weight = "trade.percentage") +
geom_node_tile(aes(fill = label), size = 0.25, color = "white")+
geom_node_text(
aes(label = paste(label, trade.percentage, sep = "\n"),
size = trade.percentage), color = "white"
)+
scale_fill_manual(values = cols)+
scale_size(range = c(0, 6) )+
theme_void()+
theme(legend.position = "none")
require("map")
# (1)
world_map <- map_data("world")
france.trade <- france.trade %>%
left_join(world_map, by = c("destination" = "region"))
# (2)
ggplot(france.trade, aes(long, lat, group = group))+
geom_polygon(aes(fill = trade.percentage ), color = "white")+
geom_polygon(data = subset(world_map, region == "France"),
fill = "red")+ # draw france in red
scale_fill_gradientn(colours =c("lightgray", "yellow", "green"))+
theme_void()
Dendrogram layout are suited for hierarchical graph visualization, that is graph structures including trees and hierarchies.
In this section, we’ll compute hierarchical clustering using the USArrests data set. The output is visualized as a dendrogram tree.
USArrests
data set;res.hclust <- scale(USArrests) %>%
dist() %>% hclust()
res.tree <- as.dendrogram(res.hclust)
geom_edge_diagonal()
and geom_edge_elbow()
# Diagonal layout
ggraph(res.tree, layout = "dendrogram") +
geom_edge_diagonal() +
geom_node_text(
aes(label = label), angle = 90, hjust = 1,
size = 3
)+
ylim(-1.5, NA)+
theme_minimal()
# Elbow layout
ggraph(res.tree, layout = "dendrogram") +
geom_edge_elbow() +
geom_node_text(
aes(label = label), angle = 90, hjust = 1,
size = 3
)+
ylim(-1.5, NA)+theme_minimal()
file., See AUTHORS. 2017. Igraph: Network Analysis and Visualization. https://CRAN.R-project.org/package=igraph.
Pedersen, Thomas Lin. 2017a. Ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. https://CRAN.R-project.org/package=ggraph.
———. 2017b. Tidygraph: A Tidy Api for Graph Manipulation. https://CRAN.R-project.org/package=tidygraph.
Tyner, Sam, François Briatte, and Heike Hofmann. 2017. “Network Visualization with ggplot2.” The R Journal 9 (1): 27–59. https://journal.r-project.org/archive/2017/RJ-2017-023/index.html.
R is a free and powerful statistical software for analyzing and visualizing data.
In this chapter, you’ll learn:
Contents:
RStudio is an integrated development environment for R that makes using R easier. R and RStudio can be installed on Windows, MAC OSX and Linux platforms.
An R package is a collection of functionalities that extends the capabilities of base R. To use the R code provide in this book, you should install the following R packages:
tidyverse
packages, which are a collection of R packages that share the same programming philosophy. These packages include:
readr
: for importing data into Rdplyr
: for data manipulationggplot2
and ggpubr
for data visualization.ggpubr
package, which makes it easy, for beginner, to create publication ready plots.install.packages("tidyverse")
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")
install.packages("ggpubr")
library()
is used for this task. An alternative function is require()
. For example, to load ggplot2 and ggpubr packages, type this:library("ggplot2")
library("ggpubr")
Now, we can use R functions, such as ggscatter() [in the ggpubr package] for creating a scatter plot.
If you want to learn more about a given function, say ggscatter(), type this in R console: ?ggscatter
.
Your data should be in rectangular format, where columns are variables and rows are observations (individuals or samples).
Column names should be compatible with R naming conventions. Avoid column with blank space and special characters. Good column names: long_jump
or long.jump
. Bad column name: long jump
.
Avoid beginning column names with a number. Use letter instead. Good column names: sport_100m
or x100m
. Bad column name: 100m
.
Replace missing values by NA
(for not available)
For example, your data should look like this:
manufacturer model displ year cyl trans drv
1 audi a4 1.8 1999 4 auto(l5) f
2 audi a4 1.8 1999 4 manual(m5) f
3 audi a4 2.0 2008 4 manual(m6) f
4 audi a4 2.0 2008 4 auto(av) f
Read more at: Best Practices in Preparing Data Files for Importing into R
First, save your data into txt or csv file formats and import it as follow (you will be asked to choose the file):
library("readr")
# Reads tab delimited files (.txt tab)
my_data <- read_tsv(file.choose())
# Reads comma (,) delimited files (.csv)
my_data <- read_csv(file.choose())
# Reads semicolon(;) separated files(.csv)
my_data <- read_csv2(file.choose())
Read more about how to import data into R at this link: http://www.sthda.com/english/wiki/importing-data-into-r
R comes with several demo data sets for playing with R functions. The most used R demo data sets include: USArrests, iris and mtcars. To load a demo data set, use the function data() as follow. The function head()
is used to inspect the data.
data("iris") # Loading
head(iris, n = 3) # Print the first n = 3 rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
To learn more about iris data sets, type this:
?iris
After typing the above R code, you will see the description of iris
data set: this iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
After importing your data in R, you can easily manipulate it using the dplyr
package (Wickham et al. 2017), which can be installed using the R code: install.packages("dplyr")
.
After loading dplyr, you can use the following R functions:
filter()
: Pick rows (observations/samples) based on their values.distinct()
: Remove duplicate rows.arrange()
: Reorder the rows.select()
: Select columns (variables) by their names.rename()
: Rename columns.mutate()
: Add/create new variables.summarise()
: Compute statistical summaries (e.g., computing the mean or the sum)group_by()
: Operate on subsets of the data set.Note that, dplyr package allows to use the forward-pipe chaining operator (%>%) for combining multiple operations. For example, x %>% f is equivalent to f(x). Using the pipe (%>%), the output of each operation is passed to the next operation. This makes R programming easy.
We’ll show you how these functions work in the different chapters of this book.
There are different graphic packages available in R for visualizing your data: 1) R base graphs, 2) Lattice Graphs (Sarkar 2016) and 3) ggplot2 (Wickham and Chang 2017).
In this section, we start by providing a quick overview of R base and lattice plots, and then we move to ggplot2 graphic system. The vast majority of plots generated in this book is based on the modern and flexible ggplot2 R package.
R comes with simple functions to create many types of graphs. For example:
Plot Types | R base function |
---|---|
Scatter plot | plot() |
Scatter plot matrix | pairs() |
Box plot | boxplot() |
Strip chart | stripchart() |
Histogram plot | hist() |
density plot | density() |
Bar plot | barplot() |
Line plot | plot() and line() |
Pie charts | pie() |
Dot charts | dotchart() |
Add text to a plot | text() |
In the most cases, you can use the following arguments to customize the plot:
pch
: change point shapes. Allowed values comprise number from 1 to 25.cex
: change point size. Example: cex = 0.8
.col
: change point color. Example: col = “blue”.frame
: logical value. frame = FALSE
removes the plot panel border frame.main
, xlab
, ylab
. Specify the main title and the x/y axis labels -, respectivelylas
: For a vertical x axis text, use las = 2
.In the following R code, we’ll use the iris data set to create a:
# (1) Create a scatter lot
plot(
x = iris$Sepal.Length, y = iris$Sepal.Width,
pch = 19, cex = 0.8, frame = FALSE,
xlab = "Sepal Length",ylab = "Sepal Width"
)
# (2) Create a box plot
boxplot(Sepal.Length ~ Species, data = iris,
ylab = "Sepal.Length",
frame = FALSE, col = "lightgray")
Read more examples at: R base Graphics on STHDA, http://www.sthda.com/english/wiki/r-base-graphs
The lattice R package provides a plotting system that aims to improve on R base graphs. After installing the package, whith the R command install.packages("lattice")
, you can test the following functions.
Plot types | Lattice functions |
---|---|
Scatter plot | xyplot() |
Scatter plot matrix | splom() |
3D scatter plot | cloud() |
Box plot | bwplot() |
strip plots (1-D scatter plots) | stripplot() |
Dot plot | dotplot() |
Bar chart | barchart() |
Histogram | histogram() |
Density plot | densityplot() |
Theoretical quantile plot | qqmath() |
Two-sample quantile plot | qq() |
3D contour plot of surfaces | contourplot() |
False color level plot of surfaces | levelplot() |
Parallel coordinates plot | parallel() |
3D wireframe graph | wireframe() |
The lattice package uses formula interface. For example, in lattice terminology, the formula y ~ x | group, means that we want to plot the y variable according to the x variable, splitting the plot into multiple panels by the variable group.
y ~ x
. Change the color by groups and use auto.key = TRUE
to show legends:library("lattice")
xyplot(
Sepal.Length ~ Petal.Length, group = Species,
data = iris, auto.key = TRUE, pch = 19, cex = 0.5
)
y ~ x | group
.xyplot(
Sepal.Length ~ Petal.Length | Species,
layout = c(3, 1), # panel with ncol = 3 and nrow = 1
group = Species, data = iris,
type = c("p", "smooth"), # Show points and smoothed line
scales = "free" # Make panels axis scales independent
)
Read more examples at: Lattice Graphics on STHDA
GGPlot2 is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics piece by piece. The gg in ggplot2 means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”. According to the ggplot2 concept, a plot can be divided into different fundamental parts: Plot = data + Aesthetics + Geometry
The ggplot2 syntax might seem opaque for beginners, but once you understand the basics, you can create and customize any kind of plots you want.
Note that, to reduce this opacity, we recently created an R package, named ggpubr (ggplot2 Based Publication Ready Plots), for making ggplot simpler for students and researchers with non-advanced programming backgrounds. We’ll present ggpubr in the next section.
After installing and loading the ggplot2 package, you can use the following key functions:
Plot types | GGPlot2 functions |
---|---|
Initialize a ggplot | ggplot() |
Scatter plot | geom_point() |
Box plot | geom_boxplot() |
Violin plot | geom_violin() |
strip chart | geom_jitter() |
Dot plot | geom_dotplot() |
Bar chart | geom_bar() |
Line plot | geom_line() |
Histogram | geom_histogram() |
Density plot | geom_density() |
Error bars | geom_errorbar() |
QQ plot | stat_qq() |
ECDF plot | stat_ecdf() |
Title and axis labels | labs() |
The main function in the ggplot2 package is ggplot()
, which can be used to initialize the plotting system with data and x/y variables.
For example, the following R code takes the iris
data set to initialize the ggplot and then a layer (geom_point()
) is added onto the ggplot to create a scatter plot of x = Sepal.Length
by y = Sepal.Width
:
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point()
# Change point size, color and shape
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point(size = 1.2, color = "steelblue", shape = 21)
Note that, in the code above, the shape of points is specified as number. To display the different point shape available in R, type this:
ggpubr::show_point_shapes()
It’s also possible to control points shape and color by a grouping variable (here, Species
). For example, in the code below, we map points color and shape to the Species
grouping variable.
# Control points color by groups
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point(aes(color = Species, shape = Species))
# Change the default color manually.
# Use the scale_color_manual() function
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point(aes(color = Species, shape = Species))+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
You can also split the plot into multiple panels according to a grouping variable. R function: facet_wrap()
. Another interesting feature of ggplot2, is the possibility to combine multiple layers on the same plot. For example, with the following R code, we’ll:
geom_point()
, colored by groups.geom_smooth()
. By default the function geom_smooth()
add the regression line and the confidence area. You can control the line color and confidence area fill color by groups.scale_color_manual()
and scale_fill_manual()
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point(aes(color = Species))+
geom_smooth(aes(color = Species, fill = Species))+
facet_wrap(~Species, ncol = 3, nrow = 1)+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
Note that, the default theme of ggplots is theme_gray()
(or theme_grey()
), which is theme with grey background and white grid lines. More themes are available for professional presentations or publications. These include: theme_bw()
, theme_classic()
and theme_minimal()
.
To change the theme of a given ggplot (p), use this: p + theme_classic()
. To change the default theme to theme_classic()
for all the future ggplots during your entire R session, type the following R code:
theme_set(
theme_classic()
)
Now you can create ggplots with theme_classic()
as default theme:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))+
geom_point()
The ggpubr R package facilitates the creation of beautiful ggplot2-based graphs for researcher with non-advanced programming backgrounds (Kassambara 2017).
For example, to create the density distribution of “Sepal.Length”, colored by groups (“Species”), type this:
library(ggpubr)
# Density plot with mean lines and marginal rug
ggdensity(iris, x = "Sepal.Length",
add = "mean", rug = TRUE, # Add mean line and marginal rugs
color = "Species", fill = "Species", # Color by groups
palette = "jco") # use jco journal color palette
Note that the argument palette
can take also a custom color palette. For example palette= c(“#00AFBB”, “#E7B800”, “#FC4E07”)
.
# Groups that we want to compare
my_comparisons <- list(
c("setosa", "versicolor"), c("versicolor", "virginica"),
c("setosa", "virginica")
)
# Create the box plot. Change colors by groups: Species
# Add jitter points and change the shape by groups
ggboxplot(
iris, x = "Species", y = "Sepal.Length",
color = "Species", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "jitter"
)+
stat_compare_means(comparisons = my_comparisons, method = "t.test")
Learn more on STHDA at: ggpubr: Publication Ready Plots
You can export R graphics to many file formats, including: PDF, PostScript, SVG vector files, Windows MetaFile (WMF), PNG, TIFF, JPEG, etc.
The standard procedure to save any graphics from R is as follow:
Additional arguments indicating the width and the height (in inches) of the graphics region can be also specified in the mentioned function.
Create a plot
Close the graphic device using the function dev.off()
For example, you can export R base plots to a pdf file as follow:
pdf("r-base-plot.pdf")
# Plot 1 --> in the first page of PDF
plot(x = iris$Sepal.Length, y = iris$Sepal.Width)
# Plot 2 ---> in the second page of the PDF
hist(iris$Sepal.Length)
dev.off()
To export ggplot2 graphs, the R code looks like this:
# Create some plots
library(ggplot2)
myplot1 <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point()
myplot2 <- ggplot(iris, aes(Species, Sepal.Length)) +
geom_boxplot()
# Print plots to a pdf file
pdf("ggplot.pdf")
print(myplot1) # Plot 1 --> in the first page of PDF
print(myplot2) # Plot 2 ---> in the second page of the PDF
dev.off()
Note that for a ggplot, you can also use the following functions to export the graphic:
ggsave()
[in ggplot2]. Makes it easy to save a ggplot. It guesses the type of graphics device from the file extension.ggexport()
[in ggpubr]. Makes it easy to arrange and export multiple ggplots at once.See also the following blog post to save high-resolution ggplots
Kassambara, Alboukadel. 2017. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. http://www.sthda.com/english/rpkgs/ggpubr.
Sarkar, Deepayan. 2016. Lattice: Trellis Graphics for R. https://CRAN.R-project.org/package=lattice.
Wickham, Hadley, and Winston Chang. 2017. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics.
Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Müller. 2017. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
To visualize one variable, the type of graphs to use depends on the type of the variable:
In this R graphics tutorial, you’ll learn how to:
Contents:
Load required packages and set the theme function theme_pubr()
[in ggpubr] as the default theme:
library(ggplot2)
library(ggpubr)
theme_set(theme_pubr())
geom_bar()
alpha
, color
, fill
, linetype
and size
Demo data set: diamonds
[in ggplot2]. Contains the prices and other attributes of almost 54000 diamonds. The column cut
contains the quality of the diamonds cut (Fair, Good, Very Good, Premium, Ideal).
The R code below creates a bar plot visualizing the number of elements in each category of diamonds cut.
ggplot(diamonds, aes(cut)) +
geom_bar(fill = "#0073C2FF") +
theme_pubclean()
Compute the frequency of each category and add the labels on the bar plot:
dplyr
package used to summarise the datageom_bar()
with option stat = "identity"
is used to create the bar plot of the summary output as it is.geom_text()
used to add text labels. Adjust the position of the labels by using hjust
(horizontal justification) and vjust
(vertical justification). Values should be in [0, 1].# Compute the frequency
library(dplyr)
df <- diamonds %>%
group_by(cut) %>%
summarise(counts = n())
df
## # A tibble: 5 x 2
## cut counts
##
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
# Create the bar plot. Use theme_pubclean() [in ggpubr]
ggplot(df, aes(x = cut, y = counts)) +
geom_bar(fill = "#0073C2FF", stat = "identity") +
geom_text(aes(label = counts), vjust = -0.3) +
theme_pubclean()
Pie chart is just a stacked bar chart in polar coordinates.
First,
cut
) in descending order. This important to compute the y coordinates of labels.cumsum(prop) - 0.5*prop
as label position.df <- df %>%
arrange(desc(cut)) %>%
mutate(prop = round(counts*100/sum(counts), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
head(df, 4)
## # A tibble: 4 x 4
## cut counts prop lab.ypos
##
## 1 Ideal 21551 40.0 20.0
## 2 Premium 13791 25.6 52.8
## 3 Very Good 12082 22.4 76.8
## 4 Good 4906 9.1 92.5
coord_polar()
.ggplot(df, aes(x = "", y = prop, fill = cut)) +
geom_bar(width = 1, stat = "identity", color = "white") +
geom_text(aes(y = lab.ypos, label = prop), color = "white")+
coord_polar("y", start = 0)+
ggpubr::fill_palette("jco")+
theme_void()
ggpie()
[in ggpubr]:ggpie(
df, x = "prop", label = "prop",
lab.pos = "in", lab.font = list(color = "white"),
fill = "cut", color = "white",
palette = "jco"
)
Dot chart is an alternative to bar plots. Key functions:
geom_linerange()
:Creates line segments from x to ymaxgeom_point()
: adds dotsggpubr::color_palette()
: changes color palette.ggplot(df, aes(cut, prop)) +
geom_linerange(
aes(x = cut, ymin = 0, ymax = prop),
color = "lightgray", size = 1.5
)+
geom_point(aes(color = cut), size = 2)+
ggpubr::color_palette("jco")+
theme_pubclean()
Easy alternative to create a dot chart. Use ggdotchart()
[ggpubr]:
ggdotchart(
df, x = "cut", y = "prop",
color = "cut", size = 3, # Points color and size
add = "segment", # Add line segments
add.params = list(size = 2),
palette = "jco",
ggtheme = theme_pubclean()
)
Different types of graphs can be used to visualize the distribution of a continuous variable, including: density and histogram plots.
Create some data (wdata
) containing the weights by sex (M for male; F for female):
set.seed(1234)
wdata = data.frame(
sex = factor(rep(c("F", "M"), each=200)),
weight = c(rnorm(200, 55), rnorm(200, 58))
)
head(wdata, 4)
## sex weight
## 1 F 53.8
## 2 F 55.3
## 3 F 56.1
## 4 F 52.7
Compute the mean weight by sex using the dplyr
package. First, the data is grouped by sex and then summarized by computing the mean weight by groups. The operator %>%
is used to combine multiple operations:
library("dplyr")
mu <- wdata %>%
group_by(sex) %>%
summarise(grp.mean = mean(weight))
mu
## # A tibble: 2 x 2
## sex grp.mean
##
## 1 F 54.9
## 2 M 58.1
We start by creating a plot, named a
, that we’ll finish in the next section by adding a layer.
a <- ggplot(wdata, aes(x = weight))
Possible layers include: geom_density()
(for density plots) and geom_histogram()
(for histogram plots).
Key arguments to customize the plots:
color, size, linetype
: change the line color, size and type, respectivelyfill
: change the areas fill color (for bar plots, histograms and density plots)alpha
: create a semi-transparent color.Key function: geom_density()
.
geom_vline()
):# y axis scale = ..density.. (default behaviour)
a + geom_density() +
geom_vline(aes(xintercept = mean(weight)),
linetype = "dashed", size = 0.6)
# Change y axis to count instead of density
a + geom_density(aes(y = ..count..), fill = "lightgray") +
geom_vline(aes(xintercept = mean(weight)),
linetype = "dashed", size = 0.6,
color = "#FC4E07")
geom_vline()
. Data: mu
, which contains the mean values of weights by sex (computed in the previous section).scale_color_manual()
or scale_colour_manual()
for changing line colorscale_fill_manual()
for changing area fill colors.# Change line color by sex
a + geom_density(aes(color = sex)) +
scale_color_manual(values = c("#868686FF", "#EFC000FF"))
# Change fill color by sex and add mean line
# Use semi-transparent fill: alpha = 0.4
a + geom_density(aes(fill = sex), alpha = 0.4) +
geom_vline(aes(xintercept = grp.mean, color = sex),
data = mu, linetype = "dashed") +
scale_color_manual(values = c("#868686FF", "#EFC000FF"))+
scale_fill_manual(values = c("#868686FF", "#EFC000FF"))
ggboxplot()
[in ggpubr].library(ggpubr)
# Basic density plot with mean line and marginal rug
ggdensity(wdata, x = "weight",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
# Change outline and fill colors by groups ("sex")
# Use a custom palette
ggdensity(wdata, x = "weight",
add = "mean", rug = TRUE,
color = "sex", fill = "sex",
palette = c("#0073C2FF", "#FC4E07"))
An alternative to density plots is histograms, which represents the distribution of a continuous variable by dividing into bins and counting the number of observations in each bin.
Key function: geom_histogram()
. The basic usage is quite similar to geom_density()
.
a + geom_histogram(bins = 30, color = "black", fill = "gray") +
geom_vline(aes(xintercept = mean(weight)),
linetype = "dashed", size = 0.6)
Note that, by default:
geom_histogram()
uses 30 bins - this might not be good default. You can change the number of bins (e.g.: bins = 50) or the bin width (e.g.: binwidth = 0.5)
y = ..density..
in aes()
.
geom_vline()
. Data: mu
, which contains the mean values of weights by sex.scale_color_manual()
or scale_colour_manual()
for changing line colorscale_fill_manual()
for changing area fill colors.position
. Allowed values: “identity”, “stack”, “dodge”. Default value is “stack”.# Change line color by sex
a + geom_histogram(aes(color = sex), fill = "white",
position = "identity") +
scale_color_manual(values = c("#00AFBB", "#E7B800"))
# change fill and outline color manually
a + geom_histogram(aes(color = sex, fill = sex),
alpha = 0.4, position = "identity") +
scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))
# Histogram with density plot
a + geom_histogram(aes(y = ..density..),
colour="black", fill="white") +
geom_density(alpha = 0.2, fill = "#FF6666")
# Color by groups
a + geom_histogram(aes(y = ..density.., color = sex),
fill = "white",
position = "identity")+
geom_density(aes(color = sex), size = 1) +
scale_color_manual(values = c("#868686FF", "#EFC000FF"))
gghistogram()
[in ggpubr].library(ggpubr)
# Basic histogram plot with mean line and marginal rug
gghistogram(wdata, x = "weight", bins = 30,
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
# Change outline and fill colors by groups ("sex")
# Use a custom palette
gghistogram(wdata, x = "weight", bins = 30,
add = "mean", rug = TRUE,
color = "sex", fill = "sex",
palette = c("#0073C2FF", "#FC4E07"))
geom_freqpoly()
.color
, size
, linetype
: change, respectively, line color, size and type.geom_area()
.color
, size
, linetype
: change, respectively, line color, size and type.fill
: change area fill color.In this section, we’ll use the theme theme_pubclean()
[in ggpubr]. This is a theme without axis lines, to direct more attention to the data. Type this to use the theme:
theme_set(theme_pubclean())
# Basic frequency polygon
a + geom_freqpoly(bins = 30)
# Basic area plots, which can be filled by color
a + geom_area( stat = "bin", bins = 30,
color = "black", fill = "#00AFBB")
# Frequency polygon:
# Change line colors and types by groups
a + geom_freqpoly( aes(color = sex, linetype = sex),
bins = 30, size = 1.5) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))
# Area plots: change fill colors by sex
# Create a stacked area plots
a + geom_area(aes(fill = sex), color = "white",
stat ="bin", bins = 30) +
scale_fill_manual(values = c("#00AFBB", "#E7B800"))
As in histogram plots, the default y values is count. To have density values on y axis, specify y = ..density..
in aes()
.
geom_dotplot()
.alpha
, color
, fill
and dotsize
.Create a dot plot colored by groups (sex):
a + geom_dotplot(aes(fill = sex), binwidth = 1/4) +
scale_fill_manual(values = c("#00AFBB", "#E7B800"))
geom_boxplot()
geom_jitter()
. Change the color and the shape of points by groups (sex)ggplot(wdata, aes(x = factor(1), y = weight)) +
geom_boxplot(width = 0.4, fill = "white") +
geom_jitter(aes(color = sex, shape = sex),
width = 0.1, size = 1) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
labs(x = NULL) # Remove x axis label
For example, in the following plots, you can see that:
# Another option for geom = "point"
a + stat_ecdf(aes(color = sex,linetype = sex),
geom = "step", size = 1.5) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))+
labs(y = "f(weight)")
stat_qq()
.color
, shape
and size
to change point color, shape and size.Create a qq-plot of weight. Change color by groups (sex)
# Change point shapes by groups
ggplot(wdata, aes(sample = weight)) +
stat_qq(aes(color = sex)) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))+
labs(y = "Weight")
Alternative plot using the function ggqqplot()
[in ggpubr]. The 95% confidence band is shown by default.
library(ggpubr)
ggqqplot(wdata, x = "weight",
color = "sex",
palette = c("#0073C2FF", "#FC4E07"),
ggtheme = theme_pubclean())
The density ridgeline plot is an alternative to the standard geom_density()
function that can be useful for visualizing changes in distributions, of a continuous variable, over time or space. Ridgeline plots are partially overlapping line plots that create the impression of a mountain range.
This functionality is provided in the R package ggridges
(Wilke 2017).
install.packages("ggridges")
theme_ridges()
[in ggridges]:library(ggplot2)
library(ggridges)
theme_set(theme_ridges())
iris
data set. The grouping variable Species will be mapped to the y-axis:ggplot(iris, aes(x = Sepal.Length, y = Species)) +
geom_density_ridges(aes(fill = Species)) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
You can control the overlap between the different densities using the scale
option. Default value is 1. Smaller values create a separation between the curves, and larger values create more overlap.
ggplot(iris, aes(x = Sepal.Length, y = Species)) +
geom_density_ridges(scale = 0.9)
Data set: lincoln_weather
[in ggridges]. Weather in Lincoln, Nebraska in 2016.
Create the density ridge plots of the Mean Temperature
by Month
and change the fill color according to the temperature value (on x axis). A gradient color is created using the function geom_density_ridges_gradient()
ggplot(
lincoln_weather,
aes(x = `Mean Temperature [F]`, y = `Month`)
) +
geom_density_ridges_gradient(
aes(fill = ..x..), scale = 3, size = 0.3
) +
scale_fill_gradientn(
colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
name = "Temp. [F]"
)+
labs(title = 'Temperatures in Lincoln NE')
For more examples, type the following R code:
browseVignettes("ggridges")
In this section, we’ll describe how to create easily basic and ordered bar plots using ggplot2 based helper functions available in the ggpubr R package. We’ll also present some modern alternatives to bar plots, including lollipop charts and cleveland’s dot plots.
library(ggpubr)
# Load data
dfm <- mtcars
# Convert the cyl variable to a factor
dfm$cyl <- as.factor(dfm$cyl)
# Add the name colums
dfm$name <- rownames(dfm)
# Inspect the data
head(dfm[, c("name", "wt", "mpg", "cyl")])
## name wt mpg cyl
## Mazda RX4 Mazda RX4 2.62 21.0 6
## Mazda RX4 Wag Mazda RX4 Wag 2.88 21.0 6
## Datsun 710 Datsun 710 2.32 22.8 4
## Hornet 4 Drive Hornet 4 Drive 3.21 21.4 6
## Hornet Sportabout Hornet Sportabout 3.44 18.7 8
## Valiant Valiant 3.46 18.1 6
mpg
variable. Change the fill color by the grouping variable “cyl”. Sorting will be done globally, but not by groups.ggbarplot(dfm, x = "name", y = "mpg",
fill = "cyl", # change fill color by cyl
color = "white", # Set bar border colors to white
palette = "jco", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in dscending order
sort.by.groups = TRUE, # Don't sort inside each group
x.text.angle = 90, # Rotate vertically x axis texts
ggtheme = theme_pubclean()
)+
font("x.text", size = 8, vjust = 0.5)
To sort bars inside each group, use the argument sort.by.groups = TRUE
ggdotchart(dfm, x = "name", y = "mpg",
color = "cyl",
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
sorting = "asc", sort.by.groups = TRUE,
add = "segments",
add.params = list(color = "lightgray", size = 2),
group = "cyl",
dot.size = 4,
ggtheme = theme_pubclean()
)+
font("x.text", size = 8, vjust = 0.5)
Read more: Bar Plots and Modern Alternatives
ggplot(diamonds, aes(cut)) +
geom_bar(fill = "#0073C2FF") +
theme_minimal()
Start by creating a plot, named a
, that we’ll be finished by adding a layer.
a <- ggplot(wdata, aes(x = weight))
Possible layers include:
Key arguments to customize the plots:
color, size, linetype
: change the line color, size and type, respectivelyfill
: change the areas fill color (for bar plots, histograms and density plots)alpha
: create a semi-transparent color.Wilke, Claus O. 2017. Ggridges: Ridgeline Plots in ’Ggplot2’. https://CRAN.R-project.org/package=ggridges.
In this chapter, we’ll show how to plot data grouped by the levels of a categorical variable.
We start by describing how to plot grouped or stacked frequencies of two categorical variables. This can be done using bar plots and dot charts. You’ll also learn how to add labels to dodged and stacked bar plots.
Next we’ll show how to display a continuous variable with multiple groups. In this situation, the grouping variable is used as the x-axis and the continuous variable as the y-axis. You’ll learn, how to:
Contents:
Load required packages and set the theme function theme_pubclean()
[in ggpubr] as the default theme:
library(dplyr)
library(ggplot2)
library(ggpubr)
theme_set(theme_pubclean())
geom_bar()
.diamonds
[in ggplot2]. The categorical variables to be used in the demo example are:
cut
: quality of the diamonds cut (Fair, Good, Very Good, Premium, Ideal)color
: diamond colour, from J (worst) to D (best).In our demo example, we’ll plot only a subset of the data (color J and D). The different steps are as follow:
df <- diamonds %>%
filter(color %in% c("J", "D")) %>%
group_by(cut, color) %>%
summarise(counts = n())
head(df, 4)
## # A tibble: 4 x 3
## # Groups: cut [2]
## cut color counts
##
## 1 Fair D 163
## 2 Fair J 119
## 3 Good D 662
## 4 Good J 307
geom_bar()
. Key argument: stat = "identity"
to plot the data as it is.scale_color_manual()
and scale_fill_manual()
to set manually the bars border line colors and area fill colors.# Stacked bar plots of y = counts by x = cut,
# colored by the variable color
ggplot(df, aes(x = cut, y = counts)) +
geom_bar(
aes(color = color, fill = color),
stat = "identity", position = position_stack()
) +
scale_color_manual(values = c("#0073C2FF", "#EFC000FF"))+
scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))
# Use position = position_dodge()
p <- ggplot(df, aes(x = cut, y = counts)) +
geom_bar(
aes(color = color, fill = color),
stat = "identity", position = position_dodge(0.8),
width = 0.7
) +
scale_color_manual(values = c("#0073C2FF", "#EFC000FF"))+
scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))
p
Note that, position_stack()
automatically stack values in reverse order of the group aesthetic. This default ensures that bar colors align with the default legend. You can change this behavior by using position = position_stack(reverse = TRUE)
.
Alternatively, you can easily create a dot chart with the ggpubr
package:
ggdotchart(df, x = "cut", y ="counts",
color = "color", palette = "jco", size = 3,
add = "segment",
add.params = list(color = "lightgray", size = 1.5),
position = position_dodge(0.3),
ggtheme = theme_pubclean()
)
Or, if you prefer the ggplot2 verbs, type this:
ggplot(df, aes(cut, counts)) +
geom_linerange(
aes(x = cut, ymin = 0, ymax = counts, group = color),
color = "lightgray", size = 1.5,
position = position_dodge(0.3)
)+
geom_point(
aes(color = color),
position = position_dodge(0.3), size = 3
)+
scale_color_manual(values = c("#0073C2FF", "#EFC000FF"))+
theme_pubclean()
p + geom_text(
aes(label = counts, group = color),
position = position_dodge(0.8),
vjust = -0.3, size = 3.5
)
position_stack()
reverse the group order, color
column should be sorted in descending order.cumsum(counts) - 0.5 * counts
.# Arrange/sort and compute cumulative summs
df <- df %>%
arrange(cut, desc(color)) %>%
mutate(lab_ypos = cumsum(counts) - 0.5 * counts)
head(df, 4)
## # A tibble: 4 x 4
## # Groups: cut [2]
## cut color counts lab_ypos
##
## 1 Fair J 119 59.5
## 2 Fair D 163 200.5
## 3 Good J 307 153.5
## 4 Good D 662 638.0
# Create stacked bar graphs with labels
ggplot(df, aes(x = cut, y = counts)) +
geom_bar(aes(color = color, fill = color), stat = "identity") +
geom_text(
aes(y = lab_ypos, label = counts, group = color),
color = "white"
) +
scale_color_manual(values = c("#0073C2FF", "#EFC000FF"))+
scale_fill_manual(values = c("#0073C2FF", "#EFC000FF"))
Alternatively, you can easily create the above plot using the function ggbarplot()
[in ggpubr]:
ggbarplot(df, x = "cut", y = "counts",
color = "color", fill = "color",
palette = c("#0073C2FF", "#EFC000FF"),
label = TRUE, lab.pos = "in", lab.col = "white",
ggtheme = theme_pubclean()
)
Key function: geom_jitter()
. Arguments: alpha, color, fill, shape and size.
In the example below, we’ll plot a small fraction (1/5) of the diamonds dataset.
diamonds.frac <- dplyr::sample_frac(diamonds, 1/5)
ggplot(diamonds.frac, aes(cut, color)) +
geom_jitter(aes(color = cut), size = 0.3)+
ggpubr::color_palette("jco")+
ggpubr::theme_pubclean()
In this section, we’ll show to plot a grouped continuous variable using box plot, violin plot, strip chart and alternatives.
We’ll also describe how to add automatically p-values comparing groups.
In this section, we’ll set the theme theme_bw()
as the default ggplot theme:
theme_set(
theme_bw()
)
ToothGrowth
len
(tooth length). Used on y-axisdose
(dose levels of vitamin C: 0.5, 1, and 2 mg/day). Used on x-axis.First, convert the variable dose
from a numeric to a discrete factor variable:
data("ToothGrowth")
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
geom_boxplot()
width
: the width of the box plotnotch
: logical. If TRUE, creates a notched box plot. The notch displays a confidence interval around the median which is normally based on the median +/- 1.58*IQR/sqrt(n)
. Notches are used to compare groups; if the notches of two boxes do not overlap, this is a strong evidence that the medians differ.color
, size
, linetype
: Border line color, size and typefill
: box plot areas fill coloroutlier.colour
, outlier.shape
, outlier.size
: The color, the shape and the size for outlying points.# Default plot
e <- ggplot(ToothGrowth, aes(x = dose, y = len))
e + geom_boxplot()
# Notched box plot with mean points
e + geom_boxplot(notch = TRUE, fill = "lightgray")+
stat_summary(fun.y = mean, geom = "point",
shape = 18, size = 2.5, color = "#FC4E07")
# Color by group (dose)
e + geom_boxplot(aes(color = dose))+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
# Change fill color by group (dose)
e + geom_boxplot(aes(fill = dose)) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
Note that, it’s possible to use the function scale_x_discrete()
for:
For example, type this:
# Choose which items to display: group "0.5" and "2"
e + geom_boxplot() +
scale_x_discrete(limits=c("0.5", "2"))
# Change the default order of items
e + geom_boxplot() +
scale_x_discrete(limits=c("2", "0.5", "1"))
Two different grouping variables are used: dose
on x-axis and supp
as fill color (legend variable).
The space between the grouped box plots is adjusted using the function position_dodge()
.
e2 <- e + geom_boxplot(
aes(fill = supp),
position = position_dodge(0.9)
) +
scale_fill_manual(values = c("#999999", "#E69F00"))
e2
Split the plot into multiple panel. Use the function facet_wrap()
:
e2 + facet_wrap(~supp)
Violin plots are similar to box plots, except that they also show the kernel probability density of the data at different values. Typically, violin plots will include a marker for the median of the data and a box indicating the interquartile range, as in standard box plots.
Key function:
geom_violin()
: Creates violin plots. Key arguments:
color
, size
, linetype
: Border line color, size and typefill
: Areas fill colortrim
: logical value. If TRUE (default), trim the tails of the violins to the range of the data. If FALSE, don’t trim the tails.stat_summary()
: Adds summary statistics (mean, median, …) on the violin plots.# Add mean points +/- SD
# Use geom = "pointrange" or geom = "crossbar"
e + geom_violin(trim = FALSE) +
stat_summary(
fun.data = "mean_sdl", fun.args = list(mult = 1),
geom = "pointrange", color = "black"
)
# Combine with box plot to add median and quartiles
# Change color by groups
e + geom_violin(aes(fill = dose), trim = FALSE) +
geom_boxplot(width = 0.2)+
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
theme(legend.position = "none")
The function mean_sdl
is used for adding mean and standard deviation. It computes the mean plus or minus a constant times the standard deviation. In the R code above, the constant is specified using the argument mult
(mult = 1). By default mult = 2. The mean +/- SD can be added as a crossbar or a pointrange.
e + geom_violin(
aes(color = supp), trim = FALSE,
position = position_dodge(0.9)
) +
geom_boxplot(
aes(color = supp), width = 0.15,
position = position_dodge(0.9)
) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))
geom_dotplot()
. Creates stacked dots, with each dot representing one observation.stackdir
: which direction to stack the dots. “up” (default), “down”, “center”, “centerwhole” (centered, but with dots aligned).stackratio
: how close to stack the dots. Default is 1, where dots just just touch. Use smaller values for closer, overlapping dots.color
, fill
: Dot border color and area filldotsize
: The diameter of the dots relative to binwidth, default 1.As for violin plots, summary statistics are usually added to dot plots.
# Violin plots with mean points +/- SD
e + geom_dotplot(
binaxis = "y", stackdir = "center",
fill = "lightgray"
) +
stat_summary(
fun.data = "mean_sdl", fun.args = list(mult=1),
geom = "pointrange", color = "red"
)
# Combine with box plots
e + geom_boxplot(width = 0.5) +
geom_dotplot(
binaxis = "y", stackdir = "center",
fill = "white"
)
# Dot plot + violin plot + stat summary
e + geom_violin(trim = FALSE) +
geom_dotplot(
binaxis='y', stackdir='center',
color = "black", fill = "#999999"
) +
stat_summary(
fun.data="mean_sdl", fun.args = list(mult=1),
geom = "pointrange", color = "#FC4E07", size = 0.4
)
# Color dots by groups
e + geom_boxplot(width = 0.5, size = 0.4) +
geom_dotplot(
aes(fill = supp), trim = FALSE,
binaxis='y', stackdir='center'
)+
scale_fill_manual(values = c("#00AFBB", "#E7B800"))
# Change the position : interval between dot plot of the same group
e + geom_boxplot(
aes(color = supp), width = 0.5, size = 0.4,
position = position_dodge(0.8)
) +
geom_dotplot(
aes(fill = supp, color = supp), trim = FALSE,
binaxis='y', stackdir='center', dotsize = 0.8,
position = position_dodge(0.8)
)+
scale_fill_manual(values = c("#00AFBB", "#E7B800"))+
scale_color_manual(values = c("#00AFBB", "#E7B800"))
Stripcharts are also known as one dimensional scatter plots. These plots are suitable compared to box plots when sample sizes are small.
geom_jitter()
color
, fill
, size
, shape
. Changes points color, fill, size and shapeposition_jitter(0.2)
e + geom_jitter(
aes(shape = dose, color = dose),
position = position_jitter(0.2),
size = 1.2
) +
stat_summary(
aes(color = dose),
fun.data="mean_sdl", fun.args = list(mult=1),
geom = "pointrange", size = 0.4
)+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
position_jitterdodge()
instead of position_dodge()
.e + geom_jitter(
aes(shape = supp, color = supp),
position = position_jitterdodge(jitter.width = 0.2, dodge.width = 0.8),
size = 1.2
) +
stat_summary(
aes(color = supp),
fun.data="mean_sdl", fun.args = list(mult=1),
geom = "pointrange", size = 0.4,
position = position_dodge(0.8)
)+
scale_color_manual(values = c("#00AFBB", "#E7B800"))
sinaplot is inspired by the strip chart and the violin plot. By letting the normalized density of points restrict the jitter along the x-axis, the plot displays the same contour as a violin plot, but resemble a simple strip chart for small number of data points (Sidiropoulos et al. 2015).
In this way the plot conveys information of both the number of data points, the density distribution, outliers and spread in a very simple, comprehensible and condensed format.
Key function: geom_sina()
[ggforce]:
library(ggforce)
# Create some data
d1 <- data.frame(
y = c(rnorm(200, 4, 1), rnorm(200, 5, 2), rnorm(400, 6, 1.5)),
group = rep(c("Grp1", "Grp2", "Grp3"), c(200, 200, 400))
)
# Sinaplot
ggplot(d1, aes(group, y)) +
geom_sina(aes(color = group), size = 0.7)+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
In this section, we’ll show how to plot summary statistics of a continuous variable organized into groups by one or multiple grouping variables.
Note that, an easy way, with less typing, to create mean/median plots, is provided in the ggpubr package. See the associated article at: ggpubr-Plot Means/Medians and Error Bars
Set the default theme to theme_pubr()
[in ggpubr]:
theme_set(ggpubr::theme_pubr())
ToothGrowth
data set.df <- ToothGrowth
df$dose <- as.factor(df$dose)
head(df, 3)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
len
organized into groups by the variable dose
:library(dplyr)
df.summary <- df %>%
group_by(dose) %>%
summarise(
sd = sd(len, na.rm = TRUE),
len = mean(len)
)
df.summary
## # A tibble: 3 x 3
## dose sd len
##
## 1 0.5 4.50 10.6
## 2 1 4.42 19.7
## 3 2 3.77 26.1
geom_crossbar()
for hollow bar with middle indicated by horizontal linegeom_errorbar()
for error barsgeom_errorbarh()
for horizontal error barsgeom_linerange()
for drawing an interval represented by a vertical linegeom_pointrange()
for creating an interval represented by a vertical line, with a point in the middle.Start by initializing ggplot with the summary statistics data:
- Specify x and y as usually - Specify ymin = len-sd
and ymax = len+sd
to add lower and upper error bars. If you want only to add upper error bars but not the lower ones, use ymin = len
(instead of len-sd
) and ymax = len+sd
.
# Initialize ggplot with data
f <- ggplot(
df.summary,
aes(x = dose, y = len, ymin = len-sd, ymax = len+sd)
)
Possible error plots:
Create simple error plots:
# Vertical line with point in the middle
f + geom_pointrange()
# Standard error bars
f + geom_errorbar(width = 0.2) +
geom_point(size = 1.5)
Create horizontal error bars. Put dose
on y axis and len
on x-axis. Specify xmin
and xmax
.
# Horizontal error bars with mean points
# Change the color by groups
ggplot(
df.summary,
aes(x = len, y = dose, xmin = len-sd, xmax = len+sd)
) +
geom_point(aes(color = dose)) +
geom_errorbarh(aes(color = dose), height=.2)+
theme_light()
df
) and specify the df.summary
data in the error plot function, here geom_pointrange()
.# Combine with jitter points
ggplot(df, aes(dose, len)) +
geom_jitter(
position = position_jitter(0.2), color = "darkgray"
) +
geom_pointrange(
aes(ymin = len-sd, ymax = len+sd),
data = df.summary
)
# Combine with violin plots
ggplot(df, aes(dose, len)) +
geom_violin(color = "darkgray", trim = FALSE) +
geom_pointrange(
aes(ymin = len-sd, ymax = len+sd),
data = df.summary
)
df.summary
data.
ymin = len-sd
and ymax = len+sd
.ymin = len
(instead of len-sd
) and ymax = len+sd
.
Note that, for line plot, you should always specify group = 1
in the aes()
, when you have one group of line.
# (1) Line plot
ggplot(df.summary, aes(dose, len)) +
geom_line(aes(group = 1)) +
geom_errorbar( aes(ymin = len-sd, ymax = len+sd),width = 0.2) +
geom_point(size = 2)
# (2) Bar plot
ggplot(df.summary, aes(dose, len)) +
geom_bar(stat = "identity", fill = "lightgray",
color = "black") +
geom_errorbar(aes(ymin = len, ymax = len+sd), width = 0.2)
For line plot, you might want to treat x-axis as numeric:
df.sum2 <- df.summary
df.sum2$dose <- as.numeric(df.sum2$dose)
ggplot(df.sum2, aes(dose, len)) +
geom_line() +
geom_errorbar( aes(ymin = len-sd, ymax = len+sd),width = 0.2) +
geom_point(size = 2)
df
data for the jitter points and the df.summary
data for the other geom
layers.
# (1) Create a line plot of means +
# individual jitter points + error bars
ggplot(df, aes(dose, len)) +
geom_jitter( position = position_jitter(0.2),
color = "darkgray") +
geom_line(aes(group = 1), data = df.summary) +
geom_errorbar(
aes(ymin = len-sd, ymax = len+sd),
data = df.summary, width = 0.2) +
geom_point(data = df.summary, size = 2)
# (2) Bar plots of means + individual jitter points + errors
ggplot(df, aes(dose, len)) +
geom_bar(stat = "identity", data = df.summary,
fill = NA, color = "black") +
geom_jitter( position = position_jitter(0.2),
color = "black") +
geom_errorbar(
aes(ymin = len-sd, ymax = len+sd),
data = df.summary, width = 0.2)
len
) and two grouping variables (dose
, supp
).len
grouped by dose
and supp
:library(dplyr)
df.summary2 <- df %>%
group_by(dose, supp) %>%
summarise(
sd = sd(len),
len = mean(len)
)
df.summary2
## # A tibble: 6 x 4
## # Groups: dose [?]
## dose supp sd len
##
## 1 0.5 OJ 4.46 13.23
## 2 0.5 VC 2.75 7.98
## 3 1 OJ 3.91 22.70
## 4 1 VC 2.52 16.77
## 5 2 OJ 2.66 26.06
## 6 2 VC 4.80 26.14
# (1) Pointrange: Vertical line with point in the middle
ggplot(df.summary2, aes(dose, len)) +
geom_pointrange(
aes(ymin = len-sd, ymax = len+sd, color = supp),
position = position_dodge(0.3)
)+
scale_color_manual(values = c("#00AFBB", "#E7B800"))
# (2) Standard error bars
ggplot(df.summary2, aes(dose, len)) +
geom_errorbar(
aes(ymin = len-sd, ymax = len+sd, color = supp),
position = position_dodge(0.3), width = 0.2
)+
geom_point(aes(color = supp), position = position_dodge(0.3)) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))
supp
)supp
)# (1) Line plot + error bars
ggplot(df.summary2, aes(dose, len)) +
geom_line(aes(linetype = supp, group = supp))+
geom_point()+
geom_errorbar(
aes(ymin = len-sd, ymax = len+sd, group = supp),
width = 0.2
)
# (2) Bar plots + upper error bars.
ggplot(df.summary2, aes(dose, len)) +
geom_bar(aes(fill = supp), stat = "identity",
position = position_dodge(0.8), width = 0.7)+
geom_errorbar(
aes(ymin = len, ymax = len+sd, group = supp),
width = 0.2, position = position_dodge(0.8)
)+
scale_fill_manual(values = c("grey80", "grey30"))
library(ggpubr)
# Create line plots of means
ggline(ToothGrowth, x = "dose", y = "len",
add = c("mean_sd", "jitter"),
color = "supp", palette = c("#00AFBB", "#E7B800"))
# Create bar plots of means
ggbarplot(ToothGrowth, x = "dose", y = "len",
add = c("mean_se", "jitter"),
color = "supp", palette = c("#00AFBB", "#E7B800"),
position = position_dodge(0.8))
# Create line plots
ggplot(df, aes(dose, len)) +
geom_jitter(
aes(color = supp),
position = position_jitter(0.2)
) +
geom_line(
aes(group = supp, color = supp),
data = df.summary2
) +
geom_errorbar(
aes(ymin = len-sd, ymax = len+sd, color = supp),
data = df.summary2, width = 0.2
)+
scale_color_manual(values = c("#00AFBB", "#E7B800"))
In this section, we’ll describe how to easily i) compare means of two or multiple groups; ii) and to automatically add p-values and significance levels to a ggplot (such as box plots, dot plots, bar plots and line plots, …).
Key functions:
compare_means()
[ggpubr package]: easy to use solution to performs one and multiple mean comparisons.stat_compare_means()
[ggpubr package]: easy to use solution to automatically add p-values and significance levels to a ggplot.The most common methods for comparing means include:
Methods | R function | Description |
---|---|---|
T-test | t.test() | Compare two groups (parametric) |
Wilcoxon test | wilcox.test() | Compare two groups (non-parametric) |
ANOVA | aov() or anova() | Compare multiple groups (parametric) |
Kruskal-Wallis | kruskal.test() | Compare multiple groups (non-parametric) |
library(ggpubr)
compare_means(len ~ supp, data = ToothGrowth,
method = "t.test")
## # A tibble: 1 x 8
## .y. group1 group2 p p.adj p.format p.signif method
##
## 1 len OJ VC 0.0606 0.0606 0.061 ns T-test
method = "t.test"
or method = "wilcox.test"
. Default is wilcoxon test.# Create a simple box plot and add p-values
p <- ggplot(ToothGrowth, aes(supp, len)) +
geom_boxplot(aes(color = supp)) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))
p + stat_compare_means(method = "t.test")
# Display the significance level instead of the p-value
# Adjust label position
p + stat_compare_means(
aes(label = ..p.signif..), label.x = 1.5, label.y = 40
)
ggpaired()
[ggpubr] to create the paired box plot.ggpaired(ToothGrowth, x = "supp", y = "len",
color = "supp", line.color = "gray", line.size = 0.4,
palette = "jco")+
stat_compare_means(paired = TRUE)
# Perorm pairwise comparisons
compare_means(len ~ dose, data = ToothGrowth)
## # A tibble: 3 x 8
## .y. group1 group2 p p.adj p.format p.signif method
##
## 1 len 0.5 1 7.02e-06 1.40e-05 7.0e-06 **** Wilcoxon
## 2 len 0.5 2 8.41e-08 2.52e-07 8.4e-08 **** Wilcoxon
## 3 len 1 2 1.77e-04 1.77e-04 0.00018 *** Wilcoxon
# Visualize: Specify the comparisons you want
my_comparisons <- list( c("0.5", "1"), c("1", "2"), c("0.5", "2") )
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(comparisons = my_comparisons)+
stat_compare_means(label.y = 50)
# Use only p.format as label. Remove method name.
ggplot(ToothGrowth, aes(supp, len)) +
geom_boxplot(aes(color = supp))+
facet_wrap(~dose) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
stat_compare_means(label = "p.format")
group
in stat_compare_means()
:ggplot(ToothGrowth, aes(dose, len)) +
geom_boxplot(aes(color = supp))+
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
stat_compare_means(aes(group = supp), label = "p.signif")
# Box plot facetted by "dose"
p <- ggpaired(ToothGrowth, x = "supp", y = "len",
color = "supp", palette = "jco",
line.color = "gray", line.size = 0.4,
facet.by = "dose", short.panel.labs = FALSE)
# Use only p.format as label. Remove method name.
p + stat_compare_means(label = "p.format", paired = TRUE)
Read more at: Add P-values and Significance Levels to ggplots
The possible ggplot2 layers include:
geom_boxplot()
for box plotgeom_violin()
for violin plotgeom_dotplot()
for dot plotgeom_jitter()
for stripchartgeom_line()
for line plotgeom_bar()
for bar plotExamples of R code: start by creating a plot, named e
, and then finish it by adding a layer:
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
e <- ggplot(ToothGrowth, aes(x = dose, y = len))
# Summary statistics
library(dplyr)
df.summary <- ToothGrowth %>%
group_by(dose) %>%
summarise(
sd = sd(len, na.rm = TRUE),
len = mean(len)
)
# Initialize ggplot with data
f <- ggplot(
df.summary,
aes(x = dose, y = len, ymin = len-sd, ymax = len+sd)
)
# Combine with violin plots
ggplot(ToothGrowth, aes(dose, len))+
geom_violin(trim = FALSE) +
geom_pointrange(aes(ymin = len-sd, ymax = len + sd),
data = df.summary)
# Combine with dot plots
ggplot(ToothGrowth, aes(dose, len))+
geom_dotplot(stackdir = "center", binaxis = "y",
fill = "lightgray", dotsize = 1) +
geom_pointrange(aes(ymin = len-sd, ymax = len + sd),
data = df.summary)
# Combine with line plot
ggplot(df.summary, aes(dose, len))+
geom_line(aes(group = 1)) +
geom_pointrange(aes(ymin = len-sd, ymax = len + sd))
# Combine with bar plots
ggplot(df.summary, aes(dose, len))+
geom_bar(stat = "identity", fill = "lightgray") +
geom_pointrange(aes(ymin = len-sd, ymax = len + sd))
Sidiropoulos, Nikos, Sina Hadi Sohi, Nicolas Rapin, and Frederik Otzen Bagger. 2015. “SinaPlot: An Enhanced Chart for Simple and Truthful Representation of Single Observations over Multiple Classes.” bioRxiv. Cold Spring Harbor Laboratory. doi:10.1101/028191.
Scatter plots are used to display the relationship between two continuous variables x and y. In this article, we’ll start by showing how to create beautiful scatter plots in R.
We’ll use helper functions in the ggpubr R package to display automatically the correlation coefficient and the significance level on the plot.
We’ll also describe how to color points by groups and to add concentration ellipses around each group. Additionally, we’ll show how to create bubble charts, as well as, how to add marginal plots (histogram, density or box plot) to a scatter plot.
We continue by showing show some alternatives to the standard scatter plots, including rectangular binning, hexagonal binning and 2d density estimation. These plot types are useful in a situation where you have a large data set containing thousands of records.
R codes for zooming, in a scatter plot, are also provided. Finally, you’ll learn how to add fitted regression trend lines and equations to a scatter graph.
Contents:
devtools::install_github("wilkelab/cowplot")
install.packages("ggpmisc")
theme_minimal()
[in ggplot2]library(ggplot2)
library(ggpubr)
theme_set(
theme_minimal() +
theme(legend.position = "top")
)
Dataset: mtcars. The variable cyl
is used as grouping variable.
# Load data
data("mtcars")
df <- mtcars
# Convert cyl as a grouping variable
df$cyl <- as.factor(df$cyl)
# Inspect the data
head(df[, c("wt", "mpg", "cyl", "qsec")], 4)
## wt mpg cyl qsec
## Mazda RX4 2.62 21.0 6 16.5
## Mazda RX4 Wag 2.88 21.0 6 17.0
## Datsun 710 2.32 22.8 4 18.6
## Hornet 4 Drive 3.21 21.4 6 19.4
Key functions:
geom_point()
: Create scatter plots. Key arguments: color
, size
and shape
to change point color, size and shape.geom_smooth()
: Add smoothed conditional means / regression line. Key arguments:
color
, size
and linetype
: Change the line color, size and type.fill
: Change the fill color of the confidence region.b <- ggplot(df, aes(x = wt, y = mpg))
# Scatter plot with regression line
b + geom_point()+
geom_smooth(method = "lm")
# Add a loess smoothed fit curve
b + geom_point()+
geom_smooth(method = "loess")
To remove the confidence region around the regression line, specify the argument se = FALSE
in the function geom_smooth()
.
Change the point shape, by specifying the argument shape
, for example:
b + geom_point(shape = 18)
To see the different point shapes commonly used in R, type this:
ggpubr::show_point_shapes()
Create easily a scatter plot using ggscatter()
[in ggpubr]. Use stat_cor()
[ggpubr] to add the correlation coefficient and the significance level.
# Add regression line and confidence interval
# Add correlation coefficient: stat_cor()
ggscatter(df, x = "wt", y = "mpg",
add = "reg.line", conf.int = TRUE,
add.params = list(fill = "lightgray"),
ggtheme = theme_minimal()
)+
stat_cor(method = "pearson",
label.x = 3, label.y = 30)
geom_rug()
.# Change color and shape by groups (cyl)
b + geom_point(aes(color = cyl, shape = cyl))+
geom_smooth(aes(color = cyl, fill = cyl), method = "lm") +
geom_rug(aes(color =cyl)) +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
# Remove confidence region (se = FALSE)
# Extend the regression lines: fullrange = TRUE
b + geom_point(aes(color = cyl, shape = cyl)) +
geom_rug(aes(color =cyl)) +
geom_smooth(aes(color = cyl), method = lm,
se = FALSE, fullrange = TRUE)+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
ggpubr::stat_cor(aes(color = cyl), label.x = 3)
facet_wrap()
:b + geom_point(aes(color = cyl, shape = cyl))+
geom_smooth(aes(color = cyl, fill = cyl),
method = "lm", fullrange = TRUE) +
facet_wrap(~cyl) +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) +
theme_bw()
stat_ellipse()
. Key arguments:
type
: The type of ellipse. The default “t” assumes a multivariate t-distribution, and “norm” assumes a multivariate normal distribution. “euclid” draws a circle with the radius equal to level, representing the euclidean distance from the center.level
: The confidence level at which to draw an ellipse (default is 0.95), or, if type=“euclid”, the radius of the circle to be drawn.b + geom_point(aes(color = cyl, shape = cyl))+
stat_ellipse(aes(color = cyl), type = "t")+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
Instead of drawing the concentration ellipse, you can: i) plot a convex hull of a set of points; ii) add the mean points and the confidence ellipse of each group. Key R functions: stat_chull()
, stat_conf_ellipse()
and stat_mean()
[in ggpubr]:
# Convex hull of groups
b + geom_point(aes(color = cyl, shape = cyl)) +
stat_chull(aes(color = cyl, fill = cyl),
alpha = 0.1, geom = "polygon") +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
# Add mean points and confidence ellipses
b + geom_point(aes(color = cyl, shape = cyl)) +
stat_conf_ellipse(aes(color = cyl, fill = cyl),
alpha = 0.1, geom = "polygon") +
stat_mean(aes(color = cyl, shape = cyl), size = 2) +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
ggpubr
. See this article: Perfect Scatter Plots with Correlation and Marginal Histograms# Add group mean points and stars
ggscatter(df, x = "wt", y = "mpg",
color = "cyl", palette = "npg",
shape = "cyl", ellipse = TRUE,
mean.point = TRUE, star.plot = TRUE,
ggtheme = theme_minimal())
# Change the ellipse type to 'convex'
ggscatter(df, x = "wt", y = "mpg",
color = "cyl", palette = "npg",
shape = "cyl",
ellipse = TRUE, ellipse.type = "convex",
ggtheme = theme_minimal())
Key functions:
geom_text()
and geom_label()
: ggplot2 standard functions to add text to a plot.geom_text_repel()
and geom_label_repel()
[in ggrepel package]. Repulsive textual annotations. Avoid text overlapping.First install ggrepel
(ìnstall.packages("ggrepel")
), then type this:
library(ggrepel)
# Add text to the plot
.labs <- rownames(df)
b + geom_point(aes(color = cyl)) +
geom_text_repel(aes(label = .labs, color = cyl), size = 3)+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
# Draw a rectangle underneath the text, making it easier to read.
b + geom_point(aes(color = cyl)) +
geom_label_repel(aes(label = .labs, color = cyl), size = 3)+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
In a bubble chart, points size
is controlled by a continuous variable, here qsec
. In the R code below, the argument alpha is used to control color transparency. alpha should be between 0 and 1.
b + geom_point(aes(color = cyl, size = qsec), alpha = 0.5) +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) +
scale_size(range = c(0.5, 12)) # Adjust the range of points size
scale_color_gradientn()
[in ggplot2], by specifying two or more colors.b + geom_point(aes(color = mpg), size = 3) +
scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07"))
The function ggMarginal()
[in ggExtra package] (Attali 2017), can be used to easily add a marginal histogram, density or box plot to a scatter plot.
First, install the ggExtra package as follow: install.packages("ggExtra")
; then type the following R code:
# Create a scatter plot
p <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point(aes(color = Species), size = 3, alpha = 0.6) +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))
# Add density distribution as marginal plot
library("ggExtra")
ggMarginal(p, type = "density")
# Change marginal plot type
ggMarginal(p, type = "boxplot")
One limitation of ggExtra is that it can’t cope with multiple groups in the scatter plot and the marginal plots.
A solution is provided in the function ggscatterhist()
[ggpubr]:
library(ggpubr)
# Grouped Scatter plot with marginal density plots
ggscatterhist(
iris, x = "Sepal.Length", y = "Sepal.Width",
color = "Species", size = 3, alpha = 0.6,
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
margin.params = list(fill = "Species", color = "black", size = 0.2)
)
# Use box plot as marginal plots
ggscatterhist(
iris, x = "Sepal.Length", y = "Sepal.Width",
color = "Species", size = 3, alpha = 0.6,
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
margin.plot = "boxplot",
ggtheme = theme_bw()
)
In this section, we’ll present some alternatives to the standard scatter plots. These include:
Rectangular binning is a very useful alternative to the standard scatter plot in a situation where you have a large data set containing thousands of records.
Rectangular binning helps to handle overplotting. Rather than plotting each point, which would appear highly dense, it divides the plane into rectangles, counts the number of cases in each rectangle, and then plots a heatmap of 2d bin counts. In this plot, many small hexagon are drawn with a color intensity corresponding to the number of cases in that bin.
Key function: geom_bin2d()
: Creates a heatmap of 2d bin counts. Key arguments: bins
, numeric vector giving number of bins in both vertical and horizontal directions. Set to 30 by default.
Key function: geom_hex()
Key function: geom_density_2d()
# Rectangular binning
ggplot(diamonds, aes(carat, price)) +
geom_bin2d(bins = 20, color ="white")+
scale_fill_gradient(low = "#00AFBB", high = "#FC4E07")+
theme_minimal()
# Hexagonal binning
ggplot(diamonds, aes(carat, price)) +
geom_hex(bins = 20, color = "white")+
scale_fill_gradient(low = "#00AFBB", high = "#FC4E07")+
theme_minimal()
# Add 2d density estimation
sp <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point(color = "lightgray")
sp + geom_density_2d()
# Use different geometry and change the gradient color
sp + stat_density_2d(aes(fill = ..level..), geom = "polygon") +
scale_fill_gradientn(colors = c("#FFEDA0", "#FEB24C", "#F03B20"))
facet_zomm()
[in ggforce] (Pedersen 2016).iris
. The R code below zoom the points where Species == "versicolor"
.library(ggforce)
ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) +
geom_point() +
ggpubr::color_palette("jco") +
facet_zoom(x = Species == "versicolor")+
theme_bw()
To zoom the points, where Petal.Length < 2.5
, type this:
ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) +
geom_point() +
ggpubr::color_palette("jco") +
facet_zoom(x = Petal.Length < 2.5)+
theme_bw()
In this section, we’ll describe how to add trend lines to a scatter plot and labels (equation, R2, BIC, AIC) for a fitted lineal model.
# Load packages and set theme
library(ggpubr)
library(ggpmisc)
theme_set(
theme_bw() +
theme(legend.position = "top")
)
# Scatter plot
p <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point(aes(color = Species), size = 3, alpha = 0.6) +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) +
scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
facet_wrap(~Species)
stat_smooth()
[ggplot2]stat_cor()
[ggpubr]stat_poly_eq()
[ggpmisc]formula <- y ~ x
p +
stat_smooth( aes(color = Species, fill = Species), method = "lm") +
stat_cor(aes(color = Species), label.y = 4.4)+
stat_poly_eq(
aes(color = Species, label = ..eq.label..),
formula = formula, label.y = 4.2, parse = TRUE)
set.seed(4321)
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x, y, group = c("A", "B"),
y2 = y * c(0.5,2), block = c("a", "a", "b", "b"))
# Polynomial regression. Sow equation and adjusted R2
formula <- y ~ poly(x, 3, raw = TRUE)
p <- ggplot(my.data, aes(x, y2, color = group)) +
geom_point() +
geom_smooth(aes(fill = group), method = "lm", formula = formula) +
stat_poly_eq(
aes(label = paste(..eq.label.., ..adj.rr.label.., sep = "~~~~")),
formula = formula, parse = TRUE
)
ggpar(p, palette = "jco")
Note that, you can also display the AIC and the BIC values using ..AIC.label..
and ..BIC.label..
in the above equation.
Other arguments (label.x, label.y) are available in the function stat_poly_eq()
to adjust label positions.
For more examples, type this R code: browseVignettes(“ggpmisc”)
.
b <- ggplot(mtcars, aes(x = wt, y = mpg))
Possible layers, include:
geom_point()
for scatter plotgeom_smooth()
for adding smoothed line such as regression linegeom_rug()
for adding a marginal ruggeom_text()
for adding textual annotationsc <- ggplot(diamonds, aes(carat, price))
Possible layers include:
geom_bin2d()
: Rectangular binning.geom_hex()
: Hexagonal binning.geom_density_2d()
: Contours from a 2d density estimateAttali, Dean. 2017. GgExtra: Add Marginal Histograms to ’Ggplot2’, and More ’Ggplot2’ Enhancements. https://github.com/daattali/ggExtra.
Pedersen, Thomas Lin. 2016. Ggforce: Accelerating ’Ggplot2’. https://github.com/thomasp85/ggforce.
When you have a bivariate data, you can easily visualize the relationship between the two variables by plotting a simple scatter plot.
For a data set containing three continuous variables, you can create a 3d scatter plot.
For a small data set with more than three variables, it’s possible to visualize the relationship between each pairs of variables by creating a scatter plot matrix. You can also compute a correlation analysis between each pairs of variables.
For a large multivariate data set, it is more difficult to visualize their relationships. Discovering knowledge from these data requires specific statistical techniques. Multivariate analysis (MVA) refers to a set of approaches used for analyzing a data set containing multiple variables.
Among these techniques, there are:
In this chapter we provide an overview of methods for visualizing multivariate data sets containing only continuous variables.
Contents:
library("magrittr") # for piping %>%
head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
You can create a 3d scatter plot using the R package scatterplot3d (Ligges, Maechler, and Schnackenberg 2017), which contains a function of the same name.
Install: install.packages("scatterplot3d")
Create a basic 3d scatter plot:
library(scatterplot3d)
scatterplot3d(
iris[,1:3], pch = 19, color = "steelblue",
grid = TRUE, box = FALSE,
mar = c(3, 3, 0.5, 3)
)
To create a scatter plot of each possible pairs of variables, you can use the function ggpairs() [in GGally
package, an extension of ggplot2](Schloerke et al. 2016) . It produces a pairwise comparison of multivariate data.
Install: install.packages("GGally")
library(GGally)
library(ggplot2)
ggpairs(iris[,-5])+ theme_bw()
p <- ggpairs(iris, aes(color = Species))+ theme_bw()
# Change color manually.
# Loop through each plot changing relevant scales
for(i in 1:p$nrow) {
for(j in 1:p$ncol){
p[i,j] <- p[i,j] +
scale_fill_manual(values=c("#00AFBB", "#E7B800", "#FC4E07")) +
scale_color_manual(values=c("#00AFBB", "#E7B800", "#FC4E07"))
}
}
p
An alternative to the function ggpairs()
is provided by the R base plot function chart.Correlation()
[in PerformanceAnalytics packages]. It displays the correlation coefficient and the significance levels as stars.
For example, type the following R code, after installing the PerformanceAnalytics
package:
# install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram=TRUE, pch=19)
Recall that, correlation analysis is used to investigate the association between two or more variables. Read more at: Correlation analyses in R.
cor()
ggcorrplot()
[in ggcorplot package]. Extension to the ggplot2 system. See more examples at: http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2.corrplot()
[in corrplot package]. R base plotting system. See examples at: http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram.Here, we’ll present only the ggcorrplot
package (Kassambara 2016), which can be installed as follow: install.packages("ggcorrplot")
.
library("ggcorrplot")
# Compute a correlation matrix
my_data <- mtcars[, c(1,3,4,5,6,7)]
corr <- round(cor(my_data), 1)
# Visualize
ggcorrplot(corr, p.mat = cor_pmat(my_data),
hc.order = TRUE, type = "lower",
color = c("#FC4E07", "white", "#00AFBB"),
outline.col = "white", lab = TRUE)
In the plot above:
Principal component analysis (PCA) is a multivariate data analysis approach that allows us to summarize and visualize the most important information contained in a multivariate data set.
PCA reduces the data into few new dimensions (or axes), which are a linear combination of the original variables. You can visualize a multivariate data by drawing a scatter plot of the first two dimensions, which contain the most important information in the data. Read more at: https://goo.gl/kabVHq
iris
prcomp()
factoextra
R package (an extension to ggplot2) (Kassambara and Mundt 2017)library("factoextra")
my_data <- iris[, -5] # Remove the grouping variable
res.pca <- prcomp(my_data, scale = TRUE)
fviz_pca_biplot(res.pca, col.ind = iris$Species,
palette = "jco", geom = "point")
In the plot above:
Cluster analysis is one of the important data mining methods for discovering knowledge in multidimensional data. The goal of clustering is to identify pattern or groups of similar objects within a data set of interest. Read more at: http://www.sthda.com/english/articles/25-cluster-analysis-in-r-practical-guide/.
This section describes how to compute and visualize hierarchical clustering, which output is a tree called dendrogram showing groups of similar individuals.
hclust()
. It takes a dissimilarity matrix as an input, which is calculated using the function dist()
.fviz_dend()
[in factoextra]USArrests
Before cluster analysis, it’s recommended to scale (or normalize) the data, to make the variables comparable. R function: scale()
, applies scaling on the column of the data (variables).
library(factoextra)
USArrests %>%
scale() %>% # Scale the data
dist() %>% # Compute distance matrix
hclust(method = "ward.D2") %>% # Hierarchical clustering
fviz_dend(cex = 0.5, k = 4, palette = "jco") # Visualize and cut
# into 4 groups
A heatmap is another way to visualize hierarchical clustering. It’s also called a false colored image, where data values are transformed to color scale. Heat maps allow us to simultaneously visualize groups of samples and features. You can easily create a pretty heatmap using the R package pheatmap
.
In heatmap, generally, columns are samples and rows are variables. Therefore we start by scaling and then transpose the data before creating the heatmap.
library(pheatmap)
USArrests %>%
scale() %>% # Scale variables
t() %>% # Transpose
pheatmap(cutree_cols = 4) # Create the heatmap
For a multivariate continuous data, you can perform the following analysis or visualization depending on the complexity of your data:
Kassambara, Alboukadel. 2016. Ggcorrplot: Visualization of a Correlation Matrix Using ’Ggplot2’. http://www.sthda.com/english/wiki/ggcorrplot.
Kassambara, Alboukadel, and Fabian Mundt. 2017. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. http://www.sthda.com/english/rpkgs/factoextra.
Ligges, Uwe, Martin Maechler, and Sarah Schnackenberg. 2017. Scatterplot3d: 3D Scatter Plot. https://CRAN.R-project.org/package=scatterplot3d.
Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2016. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.
To visualize a small data set containing multiple categorical (or qualitative) variables, you can create either a bar plot, a balloon plot or a mosaic plot.
For a large multivariate categorical data, you need specialized statistical techniques dedicated to categorical data analysis, such as simple and multiple correspondence analysis. These methods make it possible to analyze and visualize the association (i.e. correlation) between a large number of qualitative variables.
Here, you’ll learn some examples of graphs, in R programming language, for visualizing the frequency distribution of categorical variables contained in small contingency tables. We provide also the R code for computing the simple correspondence analysis.
Contents:
Load required R packages and set the default theme:
library(ggplot2)
library(ggpubr)
theme_set(theme_pubr())
Demo data set: HairEyeColor
(distribution of hair and eye color and sex in 592 statistics students)
data("HairEyeColor")
df <- as.data.frame(HairEyeColor)
head(df)
## Hair Eye Sex Freq
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
ggplot(df, aes(x = Hair, y = Freq))+
geom_bar(
aes(fill = Eye), stat = "identity", color = "white",
position = position_dodge(0.9)
)+
facet_wrap(~Sex) +
fill_palette("jco")
Balloon plot is an alternative to bar plot for visualizing a large categorical data. We’ll use the function ggballoonplot()
[in ggpubr], which draws a graphical matrix of a contingency table, where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.
Demo data sets: Housetasks
(a contingency table containing the frequency of execution of 13 house tasks in the couple.)
housetasks <- read.delim(
system.file("demo-data/housetasks.txt", package = "ggpubr"),
row.names = 1
)
head(housetasks, 4)
## Wife Alternating Husband Jointly
## Laundry 156 14 2 4
## Main_meal 124 20 5 4
## Dinner 77 11 7 13
## Breakfeast 82 36 15 7
ggballoonplot(housetasks, fill = "value")+
scale_fill_viridis_c(option = "C")
HairEyeColor
. Create a multi-panel plot by Sexdf <- as.data.frame(HairEyeColor)
ggballoonplot(df, x = "Hair", y = "Eye", size = "Freq",
fill = "Freq", facet.by = "Sex",
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C")
A mosaic plot is basically an area-proportional visualization of observed frequencies, composed of tiles (corresponding to the cells) created by recursive vertical and horizontal splits of a rectangle. The area of each tile is proportional to the corresponding cell entry, given the dimensions of previous splits.
Mosaic graph can be created using either the function mosaicplot()
[in graphics] or the function mosaic()
[in vcd package]. Read more at: Visualizing Multi-way Contingency Tables with vcd.
Example of mosaic plot:
library(vcd)
mosaic(HairEyeColor, shade = TRUE, legend = TRUE)
Correspondence analysis can be used to summarize and visualize the information contained in a large contingency table formed by two categorical variables.
Required package: FactoMineR for the analysis and factoextra for the visualization
library(FactoMineR)
library(factoextra)
res.ca <- CA(housetasks, graph = FALSE)
fviz_ca_biplot(res.ca, repel = TRUE)
From the graphic above, it’s clear that:
Read more at: Correspondence analysis in R
In this chapter, we start by describing how to plot simple and multiple time series data using the R function geom_line()
[in ggplot2].
Next, we show how to set date axis limits and add trend smoothed line to a time series graphs. Finally, we introduce some extensions to the ggplot2 package for easily handling and analyzing time series objects.
Additionally, you’ll learn how to detect peaks (maxima) and valleys (minima) in time series data.
Contents:
economics
[ggplot2] time series data sets are used.In this section we’ll plot the variables psavert
(personal savings rate) and uempmed
(number of unemployed in thousands) by date
(x-axis).
library(ggplot2)
theme_set(theme_minimal())
# Demo dataset
head(economics)
## # A tibble: 6 x 6
## date pce pop psavert uempmed unemploy
##
## 1 1967-07-01 507 198712 12.5 4.5 2944
## 2 1967-08-01 510 198911 12.5 4.7 2945
## 3 1967-09-01 516 199113 11.7 4.6 2958
## 4 1967-10-01 513 199311 12.5 4.9 3143
## 5 1967-11-01 518 199498 12.5 4.7 3066
## 6 1967-12-01 526 199657 12.1 4.8 3018
# Basic line plot
ggplot(data = economics, aes(x = date, y = pop))+
geom_line(color = "#00AFBB", size = 2)
# Plot a subset of the data
ss <- subset(economics, date > as.Date("2006-1-1"))
ggplot(data = ss, aes(x = date, y = pop)) +
geom_line(color = "#FC4E07", size = 2)
ggplot(data = economics, aes(x = date, y = pop)) +
geom_line(aes(size = unemploy/pop), color = "#FC4E07")
Here, we’ll plot the variables psavert
and uempmed
by dates. You should first reshape the data using the tidyr
package: - Collapse psavert
and uempmed
values in the same column (new column). R function: gather()[tidyr]
- Create a grouping variable that with levels = psavert
and uempmed
library(tidyr)
library(dplyr)
df <- economics %>%
select(date, psavert, uempmed) %>%
gather(key = "variable", value = "value", -date)
head(df, 3)
## # A tibble: 3 x 3
## date variable value
##
## 1 1967-07-01 psavert 12.5
## 2 1967-08-01 psavert 12.5
## 3 1967-09-01 psavert 11.7
# Multiple line plot
ggplot(df, aes(x = date, y = value)) +
geom_line(aes(color = variable), size = 1) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
theme_minimal()
# Area plot
ggplot(df, aes(x = date, y = value)) +
geom_area(aes(color = variable, fill = variable),
alpha = 0.5, position = position_dodge(0.8)) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
scale_fill_manual(values = c("#00AFBB", "#E7B800"))
Key R function: scale_x_date()
# Base plot with date axis
p <- ggplot(data = economics, aes(x = date, y = psavert)) +
geom_line(color = "#00AFBB", size = 1)
p
# Set axis limits c(min, max)
min <- as.Date("2002-1-1")
max <- NA
p + scale_x_date(limits = c(min, max))
Key function: scale_x_date()
.
To format date axis labels, you can use different combinations of days, weeks, months and years:
%a
and %A
for abbreviated and full weekday name, respectively%b
and %B
for abbreviated and full month name, respectively%d
: day of the month as decimal number%Y
: Year with century.?strptime
# Format : month/year
p + scale_x_date(date_labels = "%b/%Y")
Key function: stat_smooth()
p + stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess"
)
The ggfortify
package is an extension to ggplot2 that makes it easy to plot time series objects (Horikoshi and Tang 2017). It can handle the output of many time series packages, including: zoo::zooreg(), xts::xts(), timeSeries::timSeries(), tseries::irts(), forecast::forecast(), vars:vars().
Another interesting package is the ggpmisc
package (Aphalo 2017), which provides two useful methods for time series object:
stat_peaks()
finds at which x positions local y maxima are located, andstat_valleys()
finds at which x positions local y minima are located.Here, we’ll show how to easily:
AirPassengers
(monthly airline passenger numbers 1949-1960).changepoint
package.strucchange
package and the data set Nile
(Measurements of the annual flow of the river Nile at Aswan).ggpmisc
package and the data set lynx
(Annual Canadian Lynx trappings 1821–1934).First, install required R packages:
install.packages(
c("ggfortify", "changepoint",
"strucchange", "ggpmisc")
)
Then use the autoplot.ts()
function to visualize time series objects, as follow:
library(ggfortify)
library(magrittr) # for piping %>%
# Plot ts objects
autoplot(AirPassengers)
# Identify change points in mean and variance
AirPassengers %>%
changepoint:: cpt.meanvar() %>% # Identify change points
autoplot()
# Detect jump in a data
strucchange::breakpoints(Nile ~ 1) %>%
autoplot()
Detect peaks and valleys:
library(ggpmisc)
ggplot(lynx, as.numeric = FALSE) + geom_line() +
stat_peaks(colour = "red") +
stat_peaks(geom = "text", colour = "red",
vjust = -0.5, x.label.fmt = "%Y") +
stat_valleys(colour = "blue") +
stat_valleys(geom = "text", colour = "blue", angle = 45,
vjust = 1.5, hjust = 1, x.label.fmt = "%Y")+
ylim(-500, 7300)
Aphalo, Pedro J. 2017. Ggpmisc: Miscellaneous Extensions to ’Ggplot2’. https://CRAN.R-project.org/package=ggpmisc.
Horikoshi, Masaaki, and Yuan Tang. 2017. Ggfortify: Data Visualization Tools for Statistical Analysis Results. https://CRAN.R-project.org/package=ggfortify.