**R** comes with several **built-in data sets**, which are generally used as demo data for playing with R functions.

In this article, we’ll first describe how load and use R built-in data sets. Next, we’ll describe some of the most used R demo data sets: **mtcars**, **iris**, **ToothGrowth**, **PlantGrowth** and **USArrests**.

**Launch RStudio** as described here: Running RStudio and setting up your working directory

To see the list of pre-loaded data, type the function **data**():

`data()`

The output is as follow:

Load and print mtcars data as follow:

```
# Loading
data(mtcars)
# Print the first 6 rows
head(mtcars, 6)
```

```
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

If you want learn more about mtcars data sets, type this:

`?mtcars`

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)

- View the content of
*mtcars*data set:

```
# 1. Loading
data("mtcars")
# 2. Print
head(mtcars)
```

- It contains 32 observations and 11 variables:

```
# Number of rows (observations)
nrow(mtcars)
```

`[1] 32`

```
# Number of columns (variables)
ncol(mtcars)
```

`[1] 11`

- Description of variables:

- mpg: Miles/(US) gallon
- cyl: Number of cylinders
- disp: Displacement (cu.in.)
- hp: Gross horsepower
- drat: Rear axle ratio
- wt: Weight (1000 lbs)
- qsec: 1/4 mile time
- vs: V/S
- am: Transmission (0 = automatic, 1 = manual)
- gear: Number of forward gears
- carb: Number of carburetors

If you want to learn more about *mtcars*, type this:

`?mtcars`

**iris** data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

```
data("iris")
head(iris)
```

```
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
```

ToothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).

```
data("ToothGrowth")
head(ToothGrowth)
```

```
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
```

- len: Tooth length
- supp: Supplement type (VC or OJ).
- dose: numeric Dose in milligrams/day

Results obtained from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment condition.

```
data("PlantGrowth")
head(PlantGrowth)
```

```
weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl
```

This data set contains statistics about violent crime rates by us state.

```
data("USArrests")
head(USArrests)
```

```
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
```

- Murder: Murder arrests (per 100,000)
- Assault: Assault arrests (per 100,000)
- UrbanPop: Percent urban population
- Rape: Rape arrests (per 100,000)

Load a built-in R data set:

**data**(“dataset_name”)- Inspect the data set:
**head**(dataset_name)

This analysis has been performed using R (ver. 3.2.3).

**R** is a free and powerful statistical software for **analyzing** and **visualizing** data. In this chapter, we provide a quick and easy introduction to **R programming**.

Read more: What’is R and why learning R?

- Install R and RStudio on windows
- Install R and RStudio for MAC OSX
- Install R and RStudio on Linux

Read more: Installing R and RStudio

**Use R outside RStudio****Use R inside RStudio**- Launch RStudio under Windows, MAC OSX and Linux
- Set up your working directory
- Change your working directory
- Set up a default working directory

**Close your R/RStudio session**- Functions:
**setwd**(),**getwd**()

Read more: Running RStudio and setting up your working directory

**Basic arithmetic operations**: + (addition), - (subtraction), * (multiplication), / (division), ^ (exponentiation)

```
7 + 4 # => 11
7 - 4 # => 3
7 / 2 # => 3.5
7 * 2 # => 14
```

**Basic arithmetic functions**:- Logarithms and exponentials:
**log2**(x),**log10**(x),**exp**(x) - Trigonometric functions:
**cos**(x),**sin**(x),**tan**(x),**acos**(x),**asin**(x),**atan**(x) - Other mathematical functions:
**abs**(x): absolute value;**sqrt**(x): square root.

- Logarithms and exponentials:

```
log2(4) # => 2
abs(-4) # => 4
sqrt(4) # => 2
```

**Assigning values to variables**:

`lemon_price <- 2`

**Basic data types**:**numeric**,**character**and**logical**

```
my_age <- 28 # Numeric variable
my_name <- "Nicolas" # Character variable
# Are you a data scientist?: (yes/no) <=> (TRUE/FALSE)
is_datascientist <- TRUE # logical variable
```

**Vectors**: a combination of multiple values (numeric, character or logical)- Create a vector:
**c**() for concatenate - Case of missing values:
**NA**(not available) and**NaN**(not a number) - Get a subset of a vector: my_vector[i] to get the ith element
- Calculations with vectors:
**max**(x),**min**(x),**range**(x),**length**(x),**sum**(x),**mean**(x),**prod**(x): product of the elements in x,**sd**(x): standard deviation,**var**(x): variance,**sort**(x)

- Create a vector:

```
# Create a numeric vector
friend_ages <- c(27, 25, 29, 26)
mean(friend_ages) # => 26.75
max(friend_ages) # => 29
```

**Matrices**: like an Excel sheet containing multiple rows and columns. Combination of multiple vectors with the same types (numeric, character or logical).- Create and naming matrix:
**matrix**(),**cbind**(),**rbind**(),**rownames**(),**colnames**() - Check and convert:
**is.matrix**(),**as.matrix**() - Transpose a matrix:
**t**() - Dimensions of a matrix:
**ncol**(),**nrow**(),**dim**() - Get a subset of a matrix: my_data[row, col]
- Calculations with numeric matrices:
**rowSums**(),**colSums**(),**rowMeans**(),**colMeans**(),**apply**()

- Create and naming matrix:

```
col1 col2 col3
row1 5 2 7
row2 6 4 3
row3 7 5 4
row4 8 9 8
row5 9 8 7
```

**Factors**: grouping variables in your data- Create a factor:
**factor**(),**levels**() - Check and convert:
**is.factor**(x),**as.factor**(x) - Calculations with factors:
- Number of elements in each category:
**summary**(),**table**() - Compute some statistics by groups (for example, mean by groups):
**tapply**()

- Number of elements in each category:

- Create a factor:

```
# Create a factor
friend_groups <- factor(c("grp1", "grp2", "grp1", "grp2"))
levels(friend_groups) # => "grp1", "grp2"
```

`[1] "grp1" "grp2"`

```
# Compute the mean age by groups
friend_ages <- c(27, 25, 29, 26)
tapply(friend_ages, friend_groups, mean)
```

```
grp1 grp2
28.0 25.5
```

**Data frames**: like a matrix but can have columns with different types- Create a data frame:
**data.frame**() - Check and convert:
**is.data.frame**(),**as.data.frame**() - Transpose a data frame:
**t**() - Subset a data frame: my_data[row, col],
**subset**(),**attach**() and**detach**() - Extend a data frame:
**$**,**cbind**(),**rbind**() - Calculations with numeric data frames:
**rowSums**(),**colSums**(),**rowMeans**(),**colMeans**(),**apply**()

- Create a data frame:

```
name age height married
1 Nicolas 27 180 TRUE
2 Thierry 25 170 FALSE
3 Bernard 29 185 TRUE
4 Jerome 26 169 TRUE
```

**Lists**: collection of objects, which can be vectors, matrices, data frames,- Create a list:
**list**() - Subset a list
- Extend a list

- Create a list:

```
my_family <- list(
mother = "Veronique",
father = "Michel",
sisters = c("Alicia", "Monica"),
sister_age = c(12, 22)
)
# Print
my_family
```

```
$mother
[1] "Veronique"
$father
[1] "Michel"
$sisters
[1] "Alicia" "Monica"
$sister_age
[1] 12 22
```

Read more: R programming basics

- Getting help on a specific function:
**help**(mean),**example**(mean) - General help about R:
**help_start()** - Others functions:
**apropos**() and**help.search**()

Read more: Getting help with functions in R programming

**What is R packages?****Installing**R packages- Install a package from CRAN:
**install.packages**() - Install a package from Bioconductor:
**biocLite**() - Install a package from GitHub:
**devtools::install_github**() - View the list of installed packages:
**installed.packages**() - Folder containing installed packages:
**.libPaths**()

- Install a package from CRAN:
**Load**and use an R package:**library**()**View**loaded R packages:**search**()**Unload**an R package:**detach**(pkg_name, unload = TRUE)**Remove**installed packages:**remove.packages**()**Update**installed packages:**update.packages**()

Read more: Installing and using R packages

- List of pre-loaded data
- Loading a built-in R data
- Most used R built-in data sets
- mtcars: Motor Trend Car Road Tests
- iris
- ToothGrowth
- PlantGrowth
- USArrests

Read more: R Built-in data sets

In our previous articles, we published i) guides for installing and launching R/RStudio, ii) the basics of R programming, and ii) guides for finding help in R.

Here, we’ll describe:

- what is an
**R package** - and how to
**install**and use**R packages**

An R package is an **extension of R** containing data sets and specific functions to solve specific questions.

R comes with standard (or base) packages, which contain the basic functions and data sets as well as standard statistical and graphical functions that allow R to work.

There are also thousands other R packages available for download and installation from CRAN, Bioconductor and GitHub repositories.

After installation, you must first load the package for using the functions in the package.

Packages can be installed either from **CRAN** (for general packages), from **Bioconductor** (for biology-related packages) or from **Github** (developing versions of packages).

The function **install.packages**() is used to install a package from CRAN. The syntax is as follow:

`install.packages("package_name")`

For example, to install the package named **readr**, type this:

`install.packages("readr")`

Note that, every time you install an R package, R may ask you to specify a CRAN mirror (or server). Choose one that’s close to your location, and R will connect to that server to download and install the package files.

It’s also possible to install multiple packages at the same time, as follow:

`install.packages(c("readr", "ggplot2"))`

Bioconductor contains packages for analyzing biological related data. In the following R code, we want to install the R/Bioconductor package **limma**, which is dedicated to analyse genomic data.

To install a package from **Bioconductor**, use this:

```
source("https://bioconductor.org/biocLite.R")
biocLite("limma")
```

GitHub is a repository useful for all software development and data analysis, including R packages. It makes sharing your package easy. You can read more about GitHub here: Git and GitHub, by Hadley Wickham.

To install a package from GitHub, the R package **devtools** (by Hadley Wickham) can be used. You should first install **devtools** if you don’t have it installed on your computer.

For example, the following R code installs the latest version of **survminer** R package developed by A. Kassambara (https://github.com/kassambara/survminer).

```
install.packages("devtools")
devtools::install_github("kassambara/survminer")
```

To view the list of the already **installed packages** on your computer, type :

`installed.packages()`

Note that, in RStudio, the list of installed packages are available in the lower right window under Packages tab (see the image below).

R packages are installed in a directory called **library**. The R function **.libPaths**() can be used to get the path to the **library**.

`.libPaths()`

`[1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library"`

To use a specific function available in an R package, you have to load the R package using the function **library**().

In the following R code, we want to import a file (“http://www.sthda.com/upload/decathlon.txt”) into R using the R package **readr**, which has been installed in the previous section.

The function **read_tsv**() [in **readr**] can be used to import a tab separated .txt file:

```
# Import my data
library("readr")
my_data <- read_tsv("http://www.sthda.com/upload/decathlon.txt")
# View the first 6 rows and tge first 6 columns
# syntax: my_data[row, column]
my_data[1:6, 1:6]
```

```
name 100m Long.jump Shot.put High.jump 400m
1 SEBRLE 11.04 7.58 14.83 2.07 49.81
2 CLAY 10.76 7.40 14.26 1.86 49.37
3 KARPOV 11.02 7.30 14.77 2.04 48.37
4 BERNARD 11.02 7.23 14.25 1.92 48.93
5 YURKOV 11.34 7.09 15.19 2.10 50.42
6 WARNERS 11.11 7.60 14.31 1.98 48.68
```

To view the list of loaded (or attached) packages during an R session, use the function **search**():

`search()`

```
[1] ".GlobalEnv" "package:readr" "package:stats" "package:graphics"
[5] "package:grDevices" "package:utils" "package:datasets" "package:methods"
[9] "Autoloads" "package:base"
```

If you’re done with the package **readr** and you want to unload it, use the function **detach**():

`detach("readr", unload = TRUE)`

To remove an installed R package, use the function **remove.packages**() as follow:

`remove.packages("package_name")`

If you want to update all installed R packages, type this:

`update.packages()`

To update specific installed packages, say **readr** and **ggplot2**, use this:

`update.packages(oldPkgs = c("readr", "ggplot2"))`

**install.packages**(“package_name”): Install a package**library**(“package_name”): Load and use a package**detach**(“package_name”, unload = TRUE): Unload a package**remove.packages**(“package_name”): Remove an installed package from your computer**update.packages**(oldPkgs = “package_name”): Update a package

This analysis has been performed using **R software** (ver. 3.2.3).

In our previous articles we described how to install and start using R/RStudio. We also provide the essentials of R programming.

Here, we’ll describe how to get **help** about a specific function in **R**

To read more about a given function, for example **mean**, the R function **help**() can be used as follow:

`help(mean)`

Or use this:

`?mean`

The output look like this:

If you want to see some examples of how to use the function, type this: **example**(function_name).

`example(mean)`

Note that, typical R help files contain the following sections:

**Title****Description**: a short description of what the function does.**Usage**: the syntax of the function.**Arguments**: the description of the arguments taken by the function.**Value**: the value returned by the function**Examples**: provide examples on how to use the function

If you want to read the general documentation about R, use the function **help.start**():

`help.start()`

The output is a web page, on most R installations, which can be browsed by clicking the hyperlinks.

**apropos**(): returns a list of object, containing the pattern you searched, by partial matching. This is useful when you don’t remember exactly the name of the function:

```
# Returns the list of object containing "med"
apropos("med")
```

```
[1] ".__C__namedList" "elNamed" "elNamed<-" "median" "median.default"
[6] "medpolish" "runmed"
```

**healp.search**() (alternatively**??**): Search for documentation matching a given character in different ways. It returns a list of function containing your searched term with a short description of the function.

```
help.search("mean")
# Or use this
??mean
```

This analysis has been performed using **R software** (ver. 3.2.3).

Previously, we described how to install R/RStudio as well as how to launch R/RStudio and set up your working directory.

Here, we described the basics you should know about **R programming**, including :

- Performing basic arithmetic operations and using basic arithmetic functions
- Creating and subsetting basic data types in R

R can be used as a calculator.

The basic arithmetic operators are:

**+**(addition)**-**(subtraction)*****(multiplication)**/**(division)- and
**^**(exponentiation).

Type directly the command below in the console:

```
# Addition
3 + 7
```

`[1] 10`

```
# Substraction
7 - 3
```

`[1] 4`

```
# Multiplication
3 * 7
```

`[1] 21`

```
# Divison
7/3
```

`[1] 2.333333`

```
# Exponentiation
2^3
```

`[1] 8`

```
# Modulo: returns the remainder of the division of 8/3
8 %% 3
```

`[1] 2`

Note that, in R, ‘#’ is used for adding comments to explain what the R code is about.

**Logarithms and Exponentials**:

```
log2(x) # logarithms base 2 of x
log10(x) # logaritms base 10 of x
exp(x) # Exponential of x
```

**Trigonometric functions**:

```
cos(x) # Cosine of x
sin(x) # Sine of x
tan(x) #Tangent of x
acos(x) # arc-cosine of x
asin(x) # arc-sine of x
atan(x) #arc-tangent of x
```

**Other mathematical functions**

```
abs(x) # absolute value of x
sqrt(x) # square root of x
```

A variable can be used to store a value.

For example, the R code below will store the price of a lemon in a variable, say “lemon_price”:

```
# Price of a lemon = 2 euros
lemon_price <- 2
# or use this
lemon_price = 2
```

Note that, it’s possible to use **<-** or **=** for variable assignments.

Note that, R is case-sensitive. This means that *lemon_price* is different from *Lemon_Price*.

To print the value of the created object, just type its name:

`lemon_price`

`[1] 2`

or use the function **print()**:

`print(lemon_price)`

`[1] 2`

R saves the object *lemon_price* (also known as a variable) in memory. It’s possible to make some operations with it.

```
# Multiply lemon price by 5
5 * lemon_price
```

`[1] 10`

You can change the value of the object:

```
# Change the value
lemon_price <- 5
# Print again
lemon_price
```

`[1] 5`

The following R code creates two variables holding the width and the height of a rectangle. These two variables will be used to compute of the rectangle.

```
# Rectangle height
height <- 10
# rectangle width
width <- 5
# compute rectangle area
area <- height*width
print(area)
```

`[1] 50`

The function **ls()** can be used to see the list of objects we have created:

`ls()`

```
[1] "area" "height" "info" "lemon_price" "PACKAGES" "R_VERSION"
[7] "width"
```

The collection of objects currently stored is called the **workspace**.

Note that, each variable takes some place in the computer memory. If you work on a big project, it’s good to clean up your workspace.

To remove a variable, use the function **rm**():

```
# Remove height and width variable
rm(height, width)
# Display the remaining variables
ls()
```

`[1] "area" "info" "lemon_price" "PACKAGES" "R_VERSION" `

Basic data types are **numeric**, **character** and **logical**.

```
# Numeric object: How old are you?
my_age <- 28
# Character object: What's your name?
my_name <- "Nicolas"
# logical object: Are you a data scientist?
# (yes/no) <=> (TRUE/FALSE)
is_datascientist <- TRUE
```

Note that, character vector can be created using double (“) or single (’) quotes. If your text contains quotes, you should escape them using”\" as follow.

`'My friend\'s name is "Jerome"'`

`[1] "My friend's name is \"Jerome\""`

```
# or use this
"My friend's name is \"Jerome\""
```

`[1] "My friend's name is \"Jerome\""`

It’s possible to use the function **class**() to see what type a variable is:

`class(my_age)`

`[1] "numeric"`

`class(my_name)`

`[1] "character"`

You can also use the functions **is.numeric**(), **is.character**(), **is.logical**() to check whether a variable is numeric, character or logical, respectively. For instance:

`is.numeric(my_age)`

`[1] TRUE`

`is.numeric(my_name)`

`[1] FALSE`

If you want to change the type of a variable to another one, use the **as.*** functions, including: **as.numeric**(), **as.character**(), **as.logical**(), etc.

`my_age`

`[1] 28`

```
# Convert my_age to a character variable
as.character(my_age)
```

`[1] "28"`

Note that, the conversion of a character to a numeric will output NA (for not available). R doesn’t know how to convert a numeric variable to a character variable.

A vector is a combination of multiple values (numeric, character or logical) in the same object. In this case, you can have **numeric vectors**, **character vectors** or **logical vectors**.

A vector is created using the function **c()** (for *concatenate*), as follow:

```
# Store your friends'age in a numeric vector
friend_ages <- c(27, 25, 29, 26) # Create
friend_ages # Print
```

`[1] 27 25 29 26`

```
# Store your friend names in a character vector
my_friends <- c("Nicolas", "Thierry", "Bernard", "Jerome")
my_friends
```

`[1] "Nicolas" "Thierry" "Bernard" "Jerome" `

```
# Store your friends marital status in a logical vector
# Are they married? (yes/no <=> TRUE/FALSE)
are_married <- c(TRUE, FALSE, TRUE, TRUE)
are_married
```

`[1] TRUE FALSE TRUE TRUE`

It’s possible to give a name to the elements of a vector using the function **names()**.

```
# Vector without element names
friend_ages
```

`[1] 27 25 29 26`

```
# Vector with element names
names(friend_ages) <- c("Nicolas", "Thierry", "Bernard", "Jerome")
friend_ages
```

```
Nicolas Thierry Bernard Jerome
27 25 29 26
```

```
# You can also create a named vector as follow
friend_ages <- c(Nicolas = 27, Thierry = 25,
Bernard = 29, Jerome = 26)
friend_ages
```

```
Nicolas Thierry Bernard Jerome
27 25 29 26
```

Note that a vector can only hold elements of the same type. For example, you cannot have a vector that contains both characters and numeric values.

**Find the length of a vector**(i.e., the number of elements in a vector)

```
# Number of friends
length(my_friends)
```

`[1] 4`

I know that some of my friends (Nicolas and Thierry) have 2 child. But this information is not available (NA) for the remaining friends (Bernard and Jerome).

In R **missing values** (or missing information) are represented by NA:

```
have_child <- c(Nicolas = "yes", Thierry = "yes",
Bernard = NA, Jerome = NA)
have_child
```

```
Nicolas Thierry Bernard Jerome
"yes" "yes" NA NA
```

It’s possible to use the function **is.na**() to check whether a data contains missing value. The result of the function **is.na**() is a logical vector in which, the value TRUE specifies that the corresponding element in x is NA.

```
# Check if have_child contains missing values
is.na(have_child)
```

```
Nicolas Thierry Bernard Jerome
FALSE FALSE TRUE TRUE
```

Note that, there is a second type of **missing values** named **NaN** (“Not a Number”). This is produced in a situation where mathematical function won’t work properly, for example 0/0 = NaN.

Note also that, the function **is.na**() is TRUE for both NA and NaN values. To differentiate these, the function **is.nan**() is only TRUE for NaNs.

Subsetting a vector consists of selecting a part of your vector.

**Selection by positive indexing**: select an element of a vector by its position (index) in square brackets

```
# Select my friend number 2
my_friends[2]
```

`[1] "Thierry"`

```
# Select my friends number 2 and 4
my_friends[c(2, 4)]
```

`[1] "Thierry" "Jerome" `

```
# Select my friends number 1 to 3
my_friends[1:3]
```

`[1] "Nicolas" "Thierry" "Bernard"`

Note that, **R indexes from 1**, NOT 0. So your first column is at [1] and not [0].

If you have a named vector, it’s also possible to use the name for selecting an element:

`friend_ages["Bernard"]`

```
Bernard
29
```

**Selection by negative indexing**: Exclude an element

```
# Exclude my friend number 2
my_friends[-2]
```

`[1] "Nicolas" "Bernard" "Jerome" `

```
# Exclude my friends number 2 and 4
my_friends[-c(2, 4)]
```

`[1] "Nicolas" "Bernard"`

```
# Exclude my friends number 1 to 3
my_friends[-(1:3)]
```

`[1] "Jerome"`

**Selection by logical vector**: Only, the elements for which the corresponding value in the selecting vector is TRUE, will be kept in the subset.

```
# Select only married friends
my_friends[are_married == TRUE]
```

`[1] "Nicolas" "Bernard" "Jerome" `

```
# Friends with age >=27
my_friends[friend_ages >= 27]
```

`[1] "Nicolas" "Bernard"`

```
# Friends with age different from 27
my_friends[friend_ages != 27]
```

`[1] "Thierry" "Bernard" "Jerome" `

If you want to remove missing data, use this:

```
# Data with missing values
have_child
```

```
Nicolas Thierry Bernard Jerome
"yes" "yes" NA NA
```

```
# Keep only values different from NA (!is.na())
have_child[!is.na(have_child)]
```

```
Nicolas Thierry
"yes" "yes"
```

```
# Or, replace NA value by "NO" and then print
have_child[!is.na(have_child)] <- "NO"
have_child
```

```
Nicolas Thierry Bernard Jerome
"NO" "NO" NA NA
```

Note that, the “logical” comparison operators available in R are:

**<**: for less than**>**: for greater than**<=**: for less than or equal to**>=**: for greater than or equal to**==**: for equal to each other**!=**: not equal to each other

Note that, all the basic arithmetic operators (+, -, *, / and ^ ) as well as the common arithmetic functions (log, exp, sin, cos, tan, sqrt, abs, …), described in the previous sections, can be applied on a numeric vector.

If you perform an operation with vectors, the operation will be applied to each element of the vector. An example is provided below:

```
# My friends' salary in dollars
salaries <- c(2000, 1800, 2500, 3000)
names(salaries) <- c("Nicolas", "Thierry", "Bernard", "Jerome")
salaries
```

```
Nicolas Thierry Bernard Jerome
2000 1800 2500 3000
```

```
# Multiply salaries by 2
salaries*2
```

```
Nicolas Thierry Bernard Jerome
4000 3600 5000 6000
```

As you can see, R multiplies each element in the salaries vector with 2.

Now, suppose that you want to multiply the salaries by different coefficients. The following R code can be used:

```
# create coefs vector with the same length as salaries
coefs <- c(2, 1.5, 1, 3)
# Multiply salaries by coeff
salaries*coefs
```

```
Nicolas Thierry Bernard Jerome
4000 2700 2500 9000
```

Note that the calculation is done element-wise. The first element of salaries vector is multiplied by the first element of coefs vector, and so on.

Compute the square root of a numeric vector:

```
my_vector <- c(4, 16, 9)
sqrt(my_vector)
```

`[1] 2 4 3`

Other useful functions are:

```
max(x) # Get the maximum value of x
min(x) # Get the minimum value of x
# Get the range of x. Returns a vector containing
# the minimum and the maximum of x
range(x)
length(x) # Get the number of elements in x
sum(x) # Get the total of the elements in x
prod(x) # Get the product of the elements in x
# The mean value of the elements in x
# sum(x)/length(x)
mean(x)
sd(x) # Standard deviation of x
var(x) # Variance of x
# Sort the element of x in ascending order
sort(x)
```

For example, if you want to compute the total **sum** of salaries, type this:

`sum(salaries)`

`[1] 9300`

Compute the **mean** of salaries:

`mean(salaries)`

`[1] 2325`

The range (minimum, maximum) of salaries is:

`range(salaries)`

`[1] 1800 3000`

A **matrix** is like an Excel sheet containing multiple rows and columns. It’s used to combine vectors with the same type, which can be either numeric, character or logical. Matrices are used to store a data table in R. The rows of a matrix are generally individuals/observations and the columns are variables.

To create easily a matrix, use the function **cbind**() or **rbind**() as follow:

```
# Numeric vectors
col1 <- c(5, 6, 7, 8, 9)
col2 <- c(2, 4, 5, 9, 8)
col3 <- c(7, 3, 4, 8, 7)
# Combine the vectors by column
my_data <- cbind(col1, col2, col3)
my_data
```

```
col1 col2 col3
[1,] 5 2 7
[2,] 6 4 3
[3,] 7 5 4
[4,] 8 9 8
[5,] 9 8 7
```

```
# Change rownames
rownames(my_data) <- c("row1", "row2", "row3", "row4", "row5")
my_data
```

```
col1 col2 col3
row1 5 2 7
row2 6 4 3
row3 7 5 4
row4 8 9 8
row5 9 8 7
```

**cbind()**: combine R objects by columns**rbind()**: combine R objects by rows**rownames()**: retrieve or set row names of a matrix-like object**colnames()**: retrieve or set column names of a matrix-like object

If you want to transpose your data, use the function **t**():

`t(my_data)`

```
row1 row2 row3 row4 row5
col1 5 6 7 8 9
col2 2 4 5 9 8
col3 7 3 4 8 7
```

Note that, it’s also possible to construct a matrix using the function **matrix()**.

The simplified format of **matrix()** is as follow:

```
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)
```

**data**: an optional data vector**nrow**,**ncol**: the desired number of rows and columns, respectively.**byrow**: logical value. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.**dimnames**: A list of two vectors giving the row and column names respectively.

In the R code below, the input data has length 6. We want to create a matrix with two columns. You don’t need to specify the number of rows (here nrow = 3). R will infer this automatically. The matrix is filled column by column when the argument **byrow = FALSE**. If you want to fill the matrix by rows, use **byrow = TRUE**.

```
mdat <- matrix(
data = c(1,2,3, 11,12,13),
nrow = 2, byrow = TRUE,
dimnames = list(c("row1", "row2"), c("C.1", "C.2", "C.3"))
)
mdat
```

```
C.1 C.2 C.3
row1 1 2 3
row2 11 12 13
```

The R functions **nrow**() and **ncol**() return the number of rows and columns present in the data, respectively.

`ncol(my_data) # Number of columns`

`[1] 3`

`nrow(my_data) # Number of rows`

`[1] 5`

`dim(my_data) # Number of rows and columns`

`[1] 5 3`

**Select rows/columns**by positive indexing

rows and/or columns can be selected as follow: my_data[row, col]

```
# Select row number 2
my_data[2, ]
```

```
col1 col2 col3
6 4 3
```

```
# Select row number 2 to 4
my_data[2:4, ]
```

```
col1 col2 col3
row2 6 4 3
row3 7 5 4
row4 8 9 8
```

```
# Select multiple rows that aren't contiguous
# e.g.: rows 2 and 4 but not 3
my_data[c(2,4), ]
```

```
col1 col2 col3
row2 6 4 3
row4 8 9 8
```

```
# Select column number 3
my_data[, 3]
```

```
row1 row2 row3 row4 row5
7 3 4 8 7
```

```
# Select the value at row 2 and column 3
my_data[2, 3]
```

`[1] 3`

**Select by row/column names**

```
# Select column 2
my_data[, "col2"]
```

```
row1 row2 row3 row4 row5
2 4 5 9 8
```

```
# Select by index and names: row 3 and olumn 2
my_data[3, "col2"]
```

`[1] 5`

**Exclude rows/columns**by negative indexing

```
# Exclude column 1
my_data[, -1]
```

```
col2 col3
row1 2 7
row2 4 3
row3 5 4
row4 9 8
row5 8 7
```

**Selection by logical**: In the R code below, we want to keep only rows where col3 >=4:

```
col3 <- my_data[, "col3"]
my_data[col3 >= 4, ]
```

```
col1 col2 col3
row1 5 2 7
row3 7 5 4
row4 8 9 8
row5 9 8 7
```

- It’s also possible to perform
**simple operations on matrice**. For example, the following R code multiplies each element of the matrix by 2:

`my_data*2`

```
col1 col2 col3
row1 10 4 14
row2 12 8 6
row3 14 10 8
row4 16 18 16
row5 18 16 14
```

Or, compute the log2 values:

`log2(my_data)`

```
col1 col2 col3
row1 2.321928 1.000000 2.807355
row2 2.584963 2.000000 1.584963
row3 2.807355 2.321928 2.000000
row4 3.000000 3.169925 3.000000
row5 3.169925 3.000000 2.807355
```

**rowSums()**and**colSums()**functions: Compute the total of each row and the total of each column, respectively.

```
# Total of each row
rowSums(my_data)
```

```
row1 row2 row3 row4 row5
14 13 16 25 24
```

```
# Total of each column
colSums(my_data)
```

```
col1 col2 col3
35 28 29
```

If you are interested in row/column means, you can use the function **rowMeans**() and **colMeans**() for computing row and column means, respectively.

Note that, it’s also possible to use the function **apply**() to apply any statistical functions to rows/columns of matrices.

The simplified format of **apply**() is as follow:

`apply(X, MARGIN, FUN)`

- X: your data matrix
- MARGIN: possible values are 1 (for rows) and 2 (for columns)
- FUN: the function to apply on rows/columns

Use **apply**() as follow:

```
# Compute row means
apply(my_data, 1, mean)
```

```
row1 row2 row3 row4 row5
4.666667 4.333333 5.333333 8.333333 8.000000
```

```
# Compute row medians
apply(my_data, 1, median)
```

```
row1 row2 row3 row4 row5
5 4 5 8 8
```

```
# Compute column means
apply(my_data, 2, mean)
```

```
col1 col2 col3
7.0 5.6 5.8
```

Factor variables represent categories or groups in your data. The function **factor**() can be used to create a factor variable.

```
# Create a factor variable
friend_groups <- factor(c(1, 2, 1, 2))
friend_groups
```

```
[1] 1 2 1 2
Levels: 1 2
```

The variable *friend_groups* contains two categories of friends: 1 and 2. In R terminology, categories are called **factor levels**.

It’s possible to access to the factor levels using the function **levels()**:

```
# Get group names (or levels)
levels(friend_groups)
```

`[1] "1" "2"`

```
# Change levels
levels(friend_groups) <- c("best_friend", "not_best_friend")
friend_groups
```

```
[1] best_friend not_best_friend best_friend not_best_friend
Levels: best_friend not_best_friend
```

Note that, R orders factor levels alphabetically. If you want a different order in the levels, you can specify the levels argument in the factor function as follow.

```
# Change the order of levels
friend_groups <- factor(friend_groups,
levels = c("not_best_friend", "best_friend"))
# Print
friend_groups
```

```
[1] best_friend not_best_friend best_friend not_best_friend
Levels: not_best_friend best_friend
```

Note that:

- The function
**is.factor**() can be used to check whether a variable is a factor. Results are TRUE (if factor) or FALSE (if not factor) - The function
**as.factor**() can be used to convert a variable to a factor.

```
# Check if friend_groups is a factor
is.factor(friend_groups)
```

`[1] TRUE`

```
# Check if "are_married" is a factor
is.factor(are_married)
```

`[1] FALSE`

```
# Convert "are_married" as a factor
as.factor(are_married)
```

```
[1] TRUE FALSE TRUE TRUE
Levels: FALSE TRUE
```

- If you want to know the number of individuals in each levels, use the function
**summary()**:

`summary(friend_groups)`

```
not_best_friend best_friend
2 2
```

- In the following example, I want to compute the mean salary of my friends by groups. The function
**tapply**() can be used to apply a function, here**mean**(), to each group.

```
# Salaries of my friends
salaries
```

```
Nicolas Thierry Bernard Jerome
2000 1800 2500 3000
```

```
# Friend groups
friend_groups
```

```
[1] best_friend not_best_friend best_friend not_best_friend
Levels: not_best_friend best_friend
```

```
# Compute the mean salaries by groups
mean_salaries <- tapply(salaries, friend_groups, mean)
mean_salaries
```

```
not_best_friend best_friend
2400 2250
```

```
# Compute the size/length of each group
tapply(salaries, friend_groups, length)
```

```
not_best_friend best_friend
2 2
```

- It’s also possible to use the function
**table**() to create a frequency table, also known as a contingency table of the counts at each combination of factor levels.

`table(friend_groups)`

```
friend_groups
not_best_friend best_friend
2 2
```

```
# Cross-tabulation between
# friend_groups and are_married variables
table(friend_groups, are_married)
```

```
are_married
friend_groups FALSE TRUE
not_best_friend 1 1
best_friend 0 2
```

A data frame is like a matrix but can have columns with different types (numeric, character, logical). Rows are observations (individuals) and columns are variables.

A data frame can be created using the function **data.frame()**, as follow:

```
# Create a data frame
friends_data <- data.frame(
name = my_friends,
age = friend_ages,
height = c(180, 170, 185, 169),
married = are_married
)
# Print
friends_data
```

```
name age height married
Nicolas Nicolas 27 180 TRUE
Thierry Thierry 25 170 FALSE
Bernard Bernard 29 185 TRUE
Jerome Jerome 26 169 TRUE
```

To check whether a data is a data frame, use the **is.data.frame**() function. Returns TRUE if the data is a data frame:

`is.data.frame(friends_data)`

`[1] TRUE`

`is.data.frame(my_data)`

`[1] FALSE`

The object “friends_data” is a data frame, but not the object “my_data”. We can convert-it to a data frame using the **as.data.frame**() function:

```
# What is the class of my_data? --> matrix
class(my_data)
```

`[1] "matrix"`

```
# Convert it as a data frame
my_data2 <- as.data.frame(my_data)
# Now, the class is data.frame
class(my_data2)
```

`[1] "data.frame"`

As described in **matrix** section, you can use the function **t**() to transpose a data frame:

`t(friends_data)`

To select just certain columns from a data frame, you can either refer to the columns by name or by their location (i.e., column 1, 2, 3, etc.).

**Positive indexing**by name and by location

```
# Access the data in 'name' column
# dollar sign is used
friends_data$name
```

```
[1] Nicolas Thierry Bernard Jerome
Levels: Bernard Jerome Nicolas Thierry
```

```
# or use this
friends_data[, 'name']
```

```
[1] Nicolas Thierry Bernard Jerome
Levels: Bernard Jerome Nicolas Thierry
```

```
# Subset columns 1 and 3
friends_data[ , c(1, 3)]
```

```
name height
Nicolas Nicolas 180
Thierry Thierry 170
Bernard Bernard 185
Jerome Jerome 169
```

**Negative indexing**

```
# Exclude column 1
friends_data[, -1]
```

```
age height married
Nicolas 27 180 TRUE
Thierry 25 170 FALSE
Bernard 29 185 TRUE
Jerome 26 169 TRUE
```

**Index by characteristics**

We want to select all friends with age >= 27.

```
# Identify rows that meet the condition
friends_data$age >= 27
```

`[1] TRUE FALSE TRUE FALSE`

TRUE specifies that the row contains a value of age >= 27.

```
# Select the rows that meet the condition
friends_data[friends_data$age >= 27, ]
```

```
name age height married
Nicolas Nicolas 27 180 TRUE
Bernard Bernard 29 185 TRUE
```

The R code above, tells R to get all rows from friends_data where age >= 27, and then to return all the columns.

If you don’t want to see all the column data for the selected rows but are just interested in displaying, for example, friend names and age for friends with age >= 27, you could use the following R code:

```
# Use column locations
friends_data[friends_data$age >= 27, c(1, 2)]
```

```
name age
Nicolas Nicolas 27
Bernard Bernard 29
```

```
# Or use column names
friends_data[friends_data$age >= 27, c("name", "age")]
```

```
name age
Nicolas Nicolas 27
Bernard Bernard 29
```

If you’re finding that your selection statement is starting to be inconvenient, you can put your row and column selections into variables first, such as:

```
age27 <- friends_data$age >= 27
cols <- c("name", "age")
```

Then you can select the rows and columns with those variables:

`friends_data[age27, cols]`

```
name age
Nicolas Nicolas 27
Bernard Bernard 29
```

It’s also possible to use the function **subset**() as follow.

```
# Select friends data with age >= 27
subset(friends_data, age >= 27)
```

```
name age height married
Nicolas Nicolas 27 180 TRUE
Bernard Bernard 29 185 TRUE
```

Another option is to use the functions **attach**() and **detach**(). The function **attach**() takes a data frame and makes its columns accessible by simply giving their names.

The functions **attach**() and **detach**() can be used as follow:

```
# Attach a data frame
attach(friends_data)
# === Data manipulation ====
friends_data[age>=27, ]
# === End of data manipulation ====
# Detach the data frame
detach(friends_data)
```

**Add new column in a data frame**

```
# Add group column to friends_data
friends_data$group <- friend_groups
friends_data
```

```
name age height married group
Nicolas Nicolas 27 180 TRUE best_friend
Thierry Thierry 25 170 FALSE not_best_friend
Bernard Bernard 29 185 TRUE best_friend
Jerome Jerome 26 169 TRUE not_best_friend
```

It’s also possible to use the functions **cbind**() and **rbind**() to extend a data frame.

`cbind(friends_data, group = friend_groups)`

With numeric data frame, you can use the function **rowSums**(), **colSums**(), **colMeans**(), **rowMeans**() and **apply**() as described in **matrix** section.

A list is an ordered collection of objects, which can be vectors, matrices, data frames, etc. In other words, a list can contain all kind of R objects.

```
# Create a list
my_family <- list(
mother = "Veronique",
father = "Michel",
sisters = c("Alicia", "Monica"),
sister_age = c(12, 22)
)
# Print
my_family
```

```
$mother
[1] "Veronique"
$father
[1] "Michel"
$sisters
[1] "Alicia" "Monica"
$sister_age
[1] 12 22
```

```
# Names of elements in the list
names(my_family)
```

`[1] "mother" "father" "sisters" "sister_age"`

```
# Number of elements in the list
length(my_family)
```

`[1] 4`

The list object “my_family”, contains four components, which may be individually referred to as my_family[[1]], as_family[[2]] and so on.

It’s possible to select an element, from a list, by its name or its index:

- my_family$mother is the same as my_family[[1]]
- my_family$father is the same as my_family[[2]]

```
# Select by name (1/2)
my_family$father
```

`[1] "Michel"`

```
# Select by name (2/2)
my_family[["father"]]
```

`[1] "Michel"`

```
# Select by index
my_family[[1]]
```

`[1] "Veronique"`

`my_family[[3]]`

`[1] "Alicia" "Monica"`

```
# Select a specific element of a component
# select the first ([1]) element of my_family[[3]]
my_family[[3]][1]
```

`[1] "Alicia"`

Note that, it’s possible to extend an original list.

In the R code below, we want to add the components “grand_father” and “grand_mother” to *my_family* list object:

```
# Extend the list
my_family$grand_father <- "John"
my_family$grand_mother <- "Mary"
# Print
my_family
```

```
$mother
[1] "Veronique"
$father
[1] "Michel"
$sisters
[1] "Alicia" "Monica"
$sister_age
[1] 12 22
$grand_father
[1] "John"
$grand_mother
[1] "Mary"
```

You can also concatenate two lists as follow:

`list_abc <- c(list_a, list_b, list_c)`

The result is a list also, whose components are those of the argument lists joined together in sequence.

This analysis has been performed using **R software** (ver. 3.2.3).

After installing R and RStudio, the question is now how to start using **R/RStudio**. In this article, we’ll describe how to run **RStudio** and to set up your **working directory**.

Note that, it’s possible to use R outside or inside RStudio. However, we highly recommend to **use R inside RStudio**. RStudio allows users to run R in a more user-friendly environment.

For the first time you use R, the suggested procedure, under Windows and MAC OSX, is as follow:

Create a sub-directory, say

**R**, in your “Documents” folder. This sub-folder, also known as**working directory**, will be used by R to read and save files.Launch R by double-clicking on the icon.

- Specify your working directory to R:
- On Windows: File –> Change directory
- On MAC OSX: Tools –> Change the working directory

Open the shell prompt

Create a working directory, named “R”, using “mkdir” command:

$ mkdir R
$ cd R

- Start the R program with the command “R”:

$ R

- To quit R program, use this:

$ q()

Using R inside RStudio is the recommended choice.

After installing R and RStudio, launch RStudio from your computer “application folders”.

**RStudio screen**

RStudio is a four pane work-space for 1) creating file containing R script, 2) typing R commands, 3) viewing command histories, 4) viewing plots and more.

Top-left panel: Code editor allowing you to create and open a file containing R script. The R script is where you keep a record of your work. R script can be created as follow:

**File –> New –> R Script**.Bottom-left panel: R console for typing R commands

- Top-right panel:
- Workspace tab: shows the list of R objects you created during your R session
- History tab: shows the history of all previous commands

- Bottom-right panel:
- Files tab: show files in your working directory
- Plots tab: show the history of plots you created. From this tab, you can export a plot to a PDF or an image files
- Packages tab: show external R packages available on your system. If checked, the package is loaded in R.

For more about RStudio read the online RStudio documentation.

Recall that, the working directory is a folder where R reads and saves files.

You can change your working directory as follow:

Create a sub-directory named “R” in your “Documents” folder

- From RStudio, use the menu to change your working directory under
**Session > Set Working Directory > Choose Directory**. - Choose the directory you’ve just created in step 1

It’s also possible to use the R function **setwd()**, which stands for “set working directory”.

`setwd("/path/to/my/directory")`

For Windows, the command might look like :

`setwd("c:/Documents/my/working/directory")`

Note that, if you want to know your current (or default) R working directory, type the command **getwd()**, which stands for “get working directory”.

A default working directory is a folder where RStudio goes, every time you open it. You can change the default working directory from RStudio menu under: **Tools –> Global options –> click on “Browse” to select the default working directory you want.**

Each time you close R/RStudio, you will be asked whether you want to save the data from your R session. If you decide to save, the data will be available in future R sessions.

This analysis has been performed using **R software** (ver. 3.2.3).

In our previous article, we described what is R and why you should learn R. In this article, we’ll describe briefly how to **install R** and **RStudio** on Windows, MAC OSX and Linux platforms. **RStudio** is an integrated development environment for R that makes using R easier. It includes a console, code editor and tools for plotting.

To make things simple, we recommend to install first R and then RStudio.

R can be downloaded and installed on Windows, MAC OSX and Linux platforms from the Comprehensive R Archive Network (CRAN) webpage (http://cran.r-project.org/).

- After installing R software, install also the RStudio software available at: http://www.rstudio.com/products/RStudio/.

- Download the latest version of R, for Windows, from CRAN at : https://cran.r-project.org/bin/windows/base/

Double-click on the file you just downloaded to install R

Cick ok –> Next –> Next –> Next …. (no need to change default installation parameters)

Rtools contains tools to build your own packages on Windows, or to build R itself.

- Download Rtools version corresponding to your R version at: https://cran.r-project.org/bin/windows/Rtools/. Use the latest release of Rtools with the latest release of R.

- Double-click on the file you just downloaded to install Rtools (no need to change default installation parameters)

- Download RStudio at : https://www.rstudio.com/products/rstudio/download/

Download the latest version of R, for MAC OSX, from CRAN at : https://cran.r-project.org/bin/macosx/

Double-click on the file you just downloaded to install R

Cick ok –> Next –> Next –> Next …. (no need to change default installation parameters)

Download and install the latest version of RStudio for MAC at: https://www.rstudio.com/products/rstudio/download/

- R can be installed on Ubuntu, using the following Bash script:

sudo apt-get install r-base

- RStudio for Linux is available at https://www.rstudio.com/products/rstudio/download/

To install the latest version of R for linux, read this: Installing R on Ubuntu

It is relatively simple to install R, but if you need further help you can try the following resources:

This analysis has been performed using **R software** (ver. 3.2.3).

**R**can be used to compute a large variety of classical statistic tests including:**Student’s t-test**comparing the means of two groups of samples**Wilcoxon test**, a non parametric alternative of**t-test****Analysis of variance**(ANOVA) comparing the means of more than two groups**Chi-square test**comparing proportions/distributions**Correlation analysis**for evaluating the relationship between two or more variables

It’s also possible to use R for performing

**classification analysis**such as:**Principal component analysis****clustering**

**Many types of graphs**can be drawn using R, including: box plot, histogram, density curve, scatter plot, line plot, bar plot, …

**R**is**open source**, so it’s free.**R**is**cross-plateform**compatible, so it can be installed on Windows, MAC OSX and Linux**R**provides a wide variety of**statistical techniques**and**graphical capabilities**.**R**provides the possibility to make a**reproducible research**by embedding script and results in a single file.**R**has a**vast community**both in academia and in business**R**is**highly extensible**and it has thousands of well-documented extensions (named R packages) for a very broad range of applications in the financial sector, health care,…It’s

**easy to create R packages**for solving particular problems

This analysis has been performed using **R software** (ver. 3.2.3).