Computing and Adding new Variables to a Data Frame in R

Previously, we described the essentials of R programming and provided quick start guides for importing data into R as well as converting your data into a tibble data format, which is modern convention way to work with your data. We also described crutial steps to reshape your data with R for easier analyses.


Here, you we’ll learn how to compute and add new variables to a data frame in R. This can be done easily using the functions mutate() and transmute() in dplyr R package.


  • mutate(): Computes and adds new variable(s). Preserves existing variables. It’s similar to the R base function transform().
  • transmute(): Computes new variable(s). Drops existing variables.

Renaming Columns of a Data Table in R
Figure adapted from RStudio data wrangling cheatsheet

Pleleminary tasks

  1. Launch RStudio as described here: Running RStudio and setting up your working directory

  2. Prepare your data as described here: Best practices for preparing your data and save it in an external .txt tab or .csv files

  3. Import your data into R as described here: Fast reading of data from txt|csv files into R: readr package.

Here, we’ll use the R built-in iris data set, which we start by converting to a tibble data frame (tbl_df). Tibble is a modern rethinking of data frame providing a nicer printing method. This is useful when working with large data sets.

# Create my_data
my_data <- iris[, -5]
# Convert to a tibble
library("tibble")
my_data <- as_data_frame(my_data)
# Print
my_data
Source: local data frame [150 x 4]
   Sepal.Length Sepal.Width Petal.Length Petal.Width
                                
1           5.1         3.5          1.4         0.2
2           4.9         3.0          1.4         0.2
3           4.7         3.2          1.3         0.2
4           4.6         3.1          1.5         0.2
5           5.0         3.6          1.4         0.2
6           5.4         3.9          1.7         0.4
7           4.6         3.4          1.4         0.3
8           5.0         3.4          1.5         0.2
9           4.4         2.9          1.4         0.2
10          4.9         3.1          1.5         0.1
..          ...         ...          ...         ...

Install and load dplyr package for renaming columns

  • Install dplyr
install.packages("dplyr")
  • Load dplyr:
library("dplyr")

dplyr::mutate(): Add new variables by preserving existing ones

  • Add new columns (sepal_by_petal_*) by preserving existing ones:
mutate(my_data,
       sepal_by_petal_l = Sepal.Length/Petal.Length
       )
Source: local data frame [150 x 5]
   Sepal.Length Sepal.Width Petal.Length Petal.Width sepal_by_petal_l
          (dbl)       (dbl)        (dbl)       (dbl)            (dbl)
1           5.1         3.5          1.4         0.2         3.642857
2           4.9         3.0          1.4         0.2         3.500000
3           4.7         3.2          1.3         0.2         3.615385
4           4.6         3.1          1.5         0.2         3.066667
5           5.0         3.6          1.4         0.2         3.571429
6           5.4         3.9          1.7         0.4         3.176471
7           4.6         3.4          1.4         0.3         3.285714
8           5.0         3.4          1.5         0.2         3.333333
9           4.4         2.9          1.4         0.2         3.142857
10          4.9         3.1          1.5         0.1         3.266667
..          ...         ...          ...         ...              ...

dplyr::transmute(): Make new variables by dropping existing ones

  • Add new columns (sepal_by_petal_*) by dropping existing ones:
transmute(my_data, 
            sepal_by_petal_l = Sepal.Length/Petal.Length,
            sepal_by_petal_w = Sepal.Width/Petal.Width
            )
Source: local data frame [150 x 2]
   sepal_by_petal_l sepal_by_petal_w
              (dbl)            (dbl)
1          3.642857         17.50000
2          3.500000         15.00000
3          3.615385         16.00000
4          3.066667         15.50000
5          3.571429         18.00000
6          3.176471          9.75000
7          3.285714         11.33333
8          3.333333         17.00000
9          3.142857         14.50000
10         3.266667         31.00000
..              ...              ...

Use mutate() and transmute() programmatically inside a function:


mutate() and transmute() are best-suited for interactive use. The functions mutate_() and transmute() should be used for calling from a function. In this case the input must be “quoted”.


There are three ways to quote inputs that dplyr understands:

  • With a formula, ~Sepal.Length.
  • With quote(), quote(Sepal.Length).
  • As a string: “Sepal.Length”.
# Use formula
mutate_(my_data, 
            sepal_by_petal_l = ~Sepal.Length/Petal.Length,
            sepal_by_petal_w = ~Sepal.Width/Petal.Width
            )
# Or use quote
transmute_(my_data, 
            sepal_by_petal_l = quote(Sepal.Length/Petal.Length),
            sepal_by_petal_w = quote(Sepal.Width/Petal.Width)
            )
# or, this
transmute_(my_data, 
            sepal_by_petal_l = "Sepal.Length/Petal.Length",
            sepal_by_petal_w = "Sepal.Width/Petal.Width"
            )

transform(): R base function to compute and add new variables

dplyr::mutate() works similarly to the R base function transform(), except that in mutate() you can refer to variables you’ve just created. This is not possible in transform().

my_data2 <- transform(my_data, neg_sepal_length = -Sepal.Length)
head(my_data2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width neg_sepal_length
1          5.1         3.5          1.4         0.2             -5.1
2          4.9         3.0          1.4         0.2             -4.9
3          4.7         3.2          1.3         0.2             -4.7
4          4.6         3.1          1.5         0.2             -4.6
5          5.0         3.6          1.4         0.2             -5.0
6          5.4         3.9          1.7         0.4             -5.4

Summary


  • dplyr::mutate(iris, sepal = 2*Sepal.Length): Computes and appends new variable(s).
  • dplyr::transmute(iris, sepal = 2*Sepal.Length): Makes new variable(s) and drops existing ones.
  • transform(iris, sepal = 2*Sepal.Length): R base function similar to mutate().


Infos

This analysis has been performed using R (ver. 3.2.4).


Enjoyed this article? I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!
Avez vous aimé cet article? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!






This page has been seen 14470 times