Easy Guides

Preparing and Reshaping Data in R for Easier Analyses

Mon, 17 Oct 2016 03:43:56 +0200

Previously, we described the essentials of R programming and provided quick start guides for importing data into R. The next crucial step is to set your data into a consistent data structure for easier analyses. Here, you’ll learn modern conventions for preparing and reshaping data in order to facilitate analyses in R.

Tibble Data Format in R: Best and Modern Way to Work with your Data

Installing and loading tibble package: type install.packages(“tibble”) for installing and library(“tibble”) for loading.
Create a new tibble: data_frame(x = rnorm(100), y = rnorm(100)).
Convert your data as a tibble: as_data_frame(iris)
Advantages of tibbles compared to data frames: nice printing methods for large data sets, specification of column types.

Tidyr: crucial Step Reshaping Data with R for Easier Analyses

What is a tidy data set?: a data structure convention where each column is a variable and each row an observation
Reshaping data using tidyr package
- Installing and loading tidyr: type install.packages(“tidyr”) for installing and library(“tidyr”) for loading.
- Example data sets: USArrests
- gather(): collapse columns into rows
- spread(): spread two columns into multiple columns
- unite(): Unite multiple columns into one
- separate(): separate one column into multiple
- %>%: Chaining multiple operations

Tidyr: Crucial Step Reshaping Data with R for Easier Analyses

Fri, 22 Apr 2016 07:37:31 +0200

What is a tidy data set?
Preleminary tasks
Reshaping data using tidyr package
Summary
Related articles
References
Infos

Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is the best and modern way to work with your data.

Here, you we’ll learn how to organize (or reshape) your data in order to make the analysis easier. This process is called tidying your data.

[Figure adapted from RStudio data wrangling cheatsheet (see reference section)]

What is a tidy data set?

A data set is called tidy when:

each column represents a variable
and each row represents an observation

The opposite of tidy is messy data, which corresponds to any other arrangement of the data.

Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization.

The R package tidyr, developed by Hadley Wickham, provides functions to help you organize (or reshape) your data set into tidy format. It’s particularly designed to work in combination with magrittr and dplyr to build a solid data analysis pipeline.

Preleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory
Import your data as described here: Importing data into R

Reshaping data using tidyr package

The tidyr package, provides four functions to help you change the layout of your data set:

gather(): gather (collapse) columns into rows
spread(): spread rows into columns
separate(): separate one column into multiple
unite(): unite multiple columns into one

Installing and loading tidyr

# Installing
install.packages("tidyr")

# Loading
library("tidyr")

Example data sets

We’ll use the R built-in USArrests data sets. We start by subsetting a small data set, which will be used in the next sections as an example data set:

my_data <- USArrests[c(1, 10, 20, 30), ]
my_data

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Georgia      17.4     211       60 25.8
Maryland     11.3     300       67 27.8
New Jersey    7.4     159       89 18.8

Row names are states, so let’s use the function cbind() to add a column named “state” in the data. This will make the data tidy and the analysis easier.

my_data <- cbind(state = rownames(my_data), my_data)
my_data

                state Murder Assault UrbanPop Rape
Alabama       Alabama   13.2     236       58 21.2
Georgia       Georgia   17.4     211       60 25.8
Maryland     Maryland   11.3     300       67 27.8
New Jersey New Jersey    7.4     159       89 18.8

gather(): collapse columns into rows

The function gather() collapses multiple columns into key-value pairs. It produces a “long” data format from a “wide” one. It’s an alternative of melt() function [in reshape2 package].

Simplified format:

gather(data, key, value, ...)

data: A data frame
key, value: Names of key and value columns to create in output
…: Specification of columns to gather. Allowed values are:
- variable names
- if you want to select all variables between a and e, use a:e
- if you want to exclude a column name y use -y
- for more options, see: dplyr::select()

Examples of usage:

Gather all columns except the column state

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   -state)
my_data2

        state arrest_attribute arrest_estimate
1     Alabama           Murder            13.2
2     Georgia           Murder            17.4
3    Maryland           Murder            11.3
4  New Jersey           Murder             7.4
5     Alabama          Assault           236.0
6     Georgia          Assault           211.0
7    Maryland          Assault           300.0
8  New Jersey          Assault           159.0
9     Alabama         UrbanPop            58.0
10    Georgia         UrbanPop            60.0
11   Maryland         UrbanPop            67.0
12 New Jersey         UrbanPop            89.0
13    Alabama             Rape            21.2
14    Georgia             Rape            25.8
15   Maryland             Rape            27.8
16 New Jersey             Rape            18.8

Note that, all column names (except state) have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”).

Gather only Murder and Assault columns

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder, Assault)
my_data2

       state UrbanPop Rape arrest_attribute arrest_estimate
1    Alabama       58 21.2           Murder            13.2
2    Georgia       60 25.8           Murder            17.4
3   Maryland       67 27.8           Murder            11.3
4 New Jersey       89 18.8           Murder             7.4
5    Alabama       58 21.2          Assault           236.0
6    Georgia       60 25.8          Assault           211.0
7   Maryland       67 27.8          Assault           300.0
8 New Jersey       89 18.8          Assault           159.0

Note that, the two columns Murder and Assault have been collapsed and the remaining columns (state, UrbanPop and Rape) have been duplicated.

Gather all variables between Murder and UrbanPop

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder:UrbanPop)
my_data2

        state Rape arrest_attribute arrest_estimate
1     Alabama 21.2           Murder            13.2
2     Georgia 25.8           Murder            17.4
3    Maryland 27.8           Murder            11.3
4  New Jersey 18.8           Murder             7.4
5     Alabama 21.2          Assault           236.0
6     Georgia 25.8          Assault           211.0
7    Maryland 27.8          Assault           300.0
8  New Jersey 18.8          Assault           159.0
9     Alabama 21.2         UrbanPop            58.0
10    Georgia 25.8         UrbanPop            60.0
11   Maryland 27.8         UrbanPop            67.0
12 New Jersey 18.8         UrbanPop            89.0

The remaining state column is duplicated.

How to use gather() programmatically inside an R function?

You should use the function gather_() which takes character vectors, containing column names, instead of unquoted column names

The simplified syntax is as follow:

gather_(data, key_col, value_col, gather_cols)

data: a data frame
key_col, value_col: Strings specifying the names of key and value columns to create
gather_cols: Character vector specifying column names to be gathered together into pair of key-value columns.

As an example, type this:

gather_(my_data,
       key_col = "arrest_attribute",
       value_col = "arrest_estimate",
       gather_cols = c("Murder", "Assault"))

spread(): spread two columns into multiple columns

The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads into multiple columns. It produces a “wide” data format from a “long” one. It’s an alternative of the function cast() [in reshape2 package].

Simplified format:

spread(data, key, value)

data: A data frame
key: The (unquoted) name of the column whose values will be used as column headings.
value:The (unquoted) names of the column whose values will populate the cells.

Examples of usage:

Spread “my_data2” to turn back to the original data:

my_data3 <- spread(my_data2, 
                   key = "arrest_attribute",
                   value = "arrest_estimate"
                   )
my_data3

       state Rape Assault Murder UrbanPop
1    Alabama 21.2     236   13.2       58
2    Georgia 25.8     211   17.4       60
3   Maryland 27.8     300   11.3       67
4 New Jersey 18.8     159    7.4       89

How to use spread() programmatically inside an R function?

You should use the function spread_() which takes strings specifying key and value columns instead of unquoted column names

The simplified syntax is as follow:

spread_(data, key_col, value_col)

data: a data frame.
key_col, value_col: Strings specifying the names of key and value columns.

As an example, type this:

spread_(my_data2, 
       key = "arrest_attribute",
       value = "arrest_estimate"
       )

unite(): Unite multiple columns into one

The function unite() takes multiple columns and paste them together into one.

Simplified format:

unite(data, col, ..., sep = "_")

data: A data frame
col: The new (unquoted) name of column to add.
sep: Separator to use between values

Examples of usage:

The R code below uses the data set “my_data” and unites the columns Murder and Assault

my_data4 <- unite(my_data,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
my_data4

                state Murder_Assault UrbanPop Rape
Alabama       Alabama       13.2_236       58 21.2
Georgia       Georgia       17.4_211       60 25.8
Maryland     Maryland       11.3_300       67 27.8
New Jersey New Jersey        7.4_159       89 18.8

How to use unite() programmatically inside an R function?

You should use the function unite_() as follow.

unite_(data, col, from, sep = "_")

data: A data frame.
col: String giving the name of the new column to be added
from: Character vector specifying the names of existing columns to be united
sep: Separator to use between values.

As an example, type this:

unite_(my_data,
    col = "Murder_Assault",
    from = c("Murder", "Assault"),
    sep = "_")

separate(): separate one column into multiple

The function sperate() is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns.

Simplified format:

separate(data, col, into, sep = "[^[:alnum:]]+")

data: A data frame
col: Unquoted column names
into: Character vector specifying the names of new variables to be created.
sep: Separator between columns:
- If character, is interpreted as a regular expression.
- If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string.

Examples of usage:

Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault:

separate(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")

                state Murder Assault UrbanPop Rape
Alabama       Alabama   13.2     236       58 21.2
Georgia       Georgia   17.4     211       60 25.8
Maryland     Maryland   11.3     300       67 27.8
New Jersey New Jersey    7.4     159       89 18.8

How to use separate() programmatically inside an R function?

You should use the function separate_() as follow.

separate_(data, col, into, sep = "[^[:alnum:]]+")

data: A data frame.
col: String giving the name of the column to split
into: Character vector specifying the names of new columns to create
sep: Separator between columns (as above).

As an example, type this:

separate_(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")

Chaining multiple operations

It’s possible to combine multiple operations using maggrittr forward-pipe operator : %>%.

For example, x %>% f is equivalent to f(x).

In the following R code:

first, my_data is passed to gather() function
next, the output of gather() is passed to unite() function

my_data %>% gather(key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder:UrbanPop) %>%
            unite(col = "attribute_estimate",
                  arrest_attribute, arrest_estimate)

        state Rape attribute_estimate
1     Alabama 21.2        Murder_13.2
2     Georgia 25.8        Murder_17.4
3    Maryland 27.8        Murder_11.3
4  New Jersey 18.8         Murder_7.4
5     Alabama 21.2        Assault_236
6     Georgia 25.8        Assault_211
7    Maryland 27.8        Assault_300
8  New Jersey 18.8        Assault_159
9     Alabama 21.2        UrbanPop_58
10    Georgia 25.8        UrbanPop_60
11   Maryland 27.8        UrbanPop_67
12 New Jersey 18.8        UrbanPop_89

Summary

You should tidy your data for easier data analysis using the R package tidyr, which provides the following functions.

Collapse multiple columns together into key-value pairs (long data format): gather(data, key, value, …)
Spread key-value pairs into multiple columns (wide data format): spread(data, key, value)
Unite multiple columns into one: unite(data, col, …)
Separate one columns into multiple: separate(data, col, into)

References

The figures illustrating tidyr functions have been adapted from RStudio data wrangling cheatsheet
Learn more about tidy data: Hadley Wickham. Tidy Data. Journal of Statistical Software, August 2014, Volume 59, Issue 10..

Infos

This analysis has been performed using R (ver. 3.2.3).

Tibble Data Format in R: Best and Modern Way to Work with Your Data

Thu, 14 Apr 2016 18:44:02 +0200

Preleminary tasks
Installing and loading tibble package
Create a new tibble
Convert your data as a tibble
Advantages of tibbles compared to data frames
Summary
Related articles
Infos

Previously, we described the essentials of R programming and provided quick start guides for importing data into R. The traditional R base functions read.table(), read.delim() and read.csv() import data into R as a data frame. However, the most modern R package readr provides several functions (read_delim(), read_tsv() and read_csv()), which are faster than R base functions and import data into R as a tbl_df (pronounced as “tibble diff”).

tbl_df object is a data frame providing a nicer printing method, useful when working with large data sets.

In this article, we’ll present the tibble R package, developed by Hadley Wickham. The tibble R package provides easy to use functions for creating tibbles, which is a modern rethinking of data frames.

Preleminary tasks

Launch RStudio as described here: Running RStudio and setting up your working directory

Installing and loading tibble package

# Installing
install.packages("tibble")

# Loading
library("tibble")

Create a new tibble

To create a new tibble from combining multiple vectors, use the function data_frame():

# Create
friends_data <- data_frame(
  name = c("Nicolas", "Thierry", "Bernard", "Jerome"),
  age = c(27, 25, 29, 26),
  height = c(180, 170, 185, 169),
  married = c(TRUE, FALSE, TRUE, TRUE)
)

# Print
friends_data

Source: local data frame [4 x 4]

     name   age height married
          
1 Nicolas    27    180    TRUE
2 Thierry    25    170   FALSE
3 Bernard    29    185    TRUE
4  Jerome    26    169    TRUE

Compared to the traditional data.frame(), the modern data_frame():

never converts string as factor
never changes the names of variables
never create row names

Convert your data as a tibble

Note that, if you use the readr package to import your data into R, then you don’t need to do this step. readr imports already data as tbl_df.

To convert a traditional data as a tibble use the function as_data_frame() [in tibble package], which works on data frames, lists, matrices and tables:

library("tibble")

# Loading data
data("iris")
# Class of iris
class(iris)

[1] "data.frame"

# Print the frist 6 rows
head(iris, 6)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# Convert iris data to a tibble
my_data <- as_data_frame(iris)
class(my_data)

[1] "tbl_df"     "tbl"        "data.frame"

# Print my data
my_data

Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
                                  
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

Note that, only the first 10 rows are displayed

In the situation where you want to turn a tibble back to a data frame, use the function as.data.frame(my_data).

Advantages of tibbles compared to data frames

Tibbles have nice printing method that show only the first 10 rows and all the columns that fit on the screen. This is useful when you work with large data sets.
When printed, the data type of each column is specified (see below):
- : for double
- : for factor
- : for character
- : for logical

my_data

Source: local data frame [150 x 5]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
                                  
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
..          ...         ...          ...         ...     ...

It’s possible to change the default printing appearance as follow:

Change the maximum and the minimum rows to print: options(tibble.print_max = 20, tibble.print_min = 6)
Always show all rows: options(tibble.print_max = Inf)
Always show all columns: options(tibble.width = Inf)

Subsetting a tibble will always return a tibble. You don’t need to use drop = FALSE compared to traditional data.frames.

Summary

Create a tibble: data_frame()
Convert your data to a tibble: as_data_frame()
Change default printing appearance of a tibble: options(tibble.print_max = 20, tibble.print_min = 6)

Infos

This analysis has been performed using R (ver. 3.2.3).

Easy Guides

Preparing and Reshaping Data in R for Easier Analyses

Tidyr: Crucial Step Reshaping Data with R for Easier Analyses

What is a tidy data set?

Preleminary tasks

Reshaping data using tidyr package

Installing and loading tidyr

Example data sets

gather(): collapse columns into rows

spread(): spread two columns into multiple columns

unite(): Unite multiple columns into one

separate(): separate one column into multiple

Chaining multiple operations

Summary

Related articles

References

Infos

Tibble Data Format in R: Best and Modern Way to Work with Your Data

Preleminary tasks

Installing and loading tibble package

Create a new tibble

Convert your data as a tibble

Advantages of tibbles compared to data frames

Summary

Related articles

Infos