Select Data Frame Columns in R

40 mins

In this tutorial, you will learn how to select or subset data frame columns by names and position using the R function select() and pull() [in dplyr package]. We’ll also show how to remove columns from a data frame.

You will learn how to use the following functions:

pull(): Extract column values as a vector. The column of interest can be specified either by name or by index.
select(): Extract one or multiple columns as a data table. It can be also used to remove columns from the data frame.
select_if(): Select columns based on a particular condition. One can use this function to, for example, select columns if they are numeric.
Helper functions - starts_with(), ends_with(), contains(), matches(), one_of(): Select columns/variables based on their names

Contents:

Required packages
Demo dataset
Extract column values as a vector
Extract columns as a data table
- Select column by position
- Select columns by names
Select column based on a condtion
Remove columns
Summary

Required packages

Load the tidyverse packages, which include dplyr:

library(tidyverse)

Demo dataset

We’ll use the R built-in iris data set, which we start by converting into a tibble data frame (tbl_df) for easier data analysis.

my_data <- as_tibble(iris)
my_data

## # A tibble: 150 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa 
## # ... with 144 more rows

Extract column values as a vector

my_data %>% pull(Species)

##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Extract columns as a data table

Select column by position

Select columns 1 to 3:

my_data %>% select(1:3)

Select column 1 and 3 but not 2:

my_data %>% select(1, 3)

Select columns by names

Select columns by names: Sepal.Length and Petal.Length

my_data %>% select(Sepal.Length, Petal.Length)

## # A tibble: 150 x 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3
## 4          4.6          1.5
## 5          5            1.4
## 6          5.4          1.7
## # ... with 144 more rows

Select all columns from Sepal.Length to Petal.Length

my_data %>% select(Sepal.Length:Petal.Length)

## # A tibble: 150 x 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
## 2          4.9         3            1.4
## 3          4.7         3.2          1.3
## 4          4.6         3.1          1.5
## 5          5           3.6          1.4
## 6          5.4         3.9          1.7
## # ... with 144 more rows

There are several special functions that can be used inside select(): starts_with(), ends_with(), contains(), matches(), one_of(), etc.

# Select column whose name starts with "Petal"
my_data %>% select(starts_with("Petal"))

# Select column whose name ends with "Width"
my_data %>% select(ends_with("Width"))

# Select columns whose names contains "etal"
my_data %>% select(contains("etal"))
  
# Select columns whose name maches a regular expression
my_data %>% select(matches(".t."))

# selects variables provided in a character vector.
my_data %>% select(one_of(c("Sepal.Length", "Petal.Length")))

Select column based on a condtion

It’s possible to apply a function to the columns. The columns for which the function returns TRUE are selected.

Select only numeric columns:

my_data %>% select_if(is.numeric)

## # A tibble: 150 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3            1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5           3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
## # ... with 144 more rows

Remove columns

Note that, to remove a column from a data frame, prepend its name by minus -.

Removing Sepal.Length and Petal.Length columns:

my_data %>% select(-Sepal.Length, -Petal.Length)

Removing all columns from Sepal.Length to Petal.Length:

my_data %>% select(-(Sepal.Length:Petal.Length))

## # A tibble: 150 x 2
##   Petal.Width Species
##         <dbl> <fct>  
## 1         0.2 setosa 
## 2         0.2 setosa 
## 3         0.2 setosa 
## 4         0.2 setosa 
## 5         0.2 setosa 
## 6         0.4 setosa 
## # ... with 144 more rows

Removing all columns whose name starts with “Petal”:

my_data %>% select(-starts_with("Petal"))

## # A tibble: 150 x 3
##   Sepal.Length Sepal.Width Species
##          <dbl>       <dbl> <fct>  
## 1          5.1         3.5 setosa 
## 2          4.9         3   setosa 
## 3          4.7         3.2 setosa 
## 4          4.6         3.1 setosa 
## 5          5           3.6 setosa 
## 6          5.4         3.9 setosa 
## # ... with 144 more rows

Note that, if you want to drop columns by position, the syntax is as follow.

# Drop column 1
my_data %>% select(-1)

# Drop columns 1 to 3
my_data %>% select(-(1:3))

# Drop columns 1 and 3 but not 2
my_data %>% select(-1, -3)

Summary

In this tutorial, we describe how to select columns by positions and by names. Additionally, we present how to remove columns from a data frame.

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Amazon FBA

Amazing Selling Machine

Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Back to Data Manipulation in R

Comments ( 9 )

Pearson

30 Oct 2018

Thank you for sharing!

Reply
- Kassambara
  
  30 Oct 2018
  
  Your feedback is appreciated!
  
  Reply
Barry

31 Jan 2019

Thanks for the example!

One of the things I am always trying and failing to do is to work with a set of variables from a data frame – not pulling them out, but applying a function to them.

If I use the ‘select’ command, then I’ve created a copy of the data frame, and have to match it back to the original.

Is there any way to use pipes and dplyr to apply a command to a subset of columns?

Reply
- Kassambara
  
  01 Feb 2019
  
  Thank you for your comments.
  
  What do you think about the mutant() family of functions described at: https://www.datanovia.com/en/lessons/compute-and-add-new-variables-to-a-data-frame-in-r/ ?
  
  Reply
Monhieq

21 Feb 2019

thank you for your sharing

i am beginner in R. i have some questions.
how combine multi table ( i used .csv) to be one, because i saw that from those tables (8 tables) the field name is all same, but the file name is different.
can you help me how to combine those tables and add 1 field from each table (field name is Value) and added field name change to the table name?

e.g : number of table : 8 (T1 – T8)
number of field in every table = 10 (F1 – F10). F1 – F9 in T1 – T10 fill with the same data value
Data value of F10 is the only field that filled different value in every tables
How to combine (T9 ) with F1-F9, and add F10-F17 named as the tables name and will fills with the value of F10 in every tables?

I hope you can help and give your response to me. Thank you so much

Reply
MadDeveloper

17 Apr 2020

very helpful. Thanks a lot!!!

Reply
Fiona

11 May 2020

Hello, I tried the pull method on a column in my dataframe but received the following error message:

Error in UseMethod(“pull”) :
no applicable method for ‘pull’ applied to an object of class “factor”

Is there anything I can do to fix this?

Reply
- Kassambara
  
  11 May 2020
  
  Did this work?
  
  iris %>% dplyr::pull(Species)
  
  Reply
Sathira

23 May 2021

Thanks !!!

Reply