Easy Guides

Text mining and word cloud fundamentals in R : 5 simple steps you should know

Sun, 12 Feb 2017 05:23:30 +0100

Text mining methods allow us to highlight the most frequently used keywords in a paragraph of texts. One can create a word cloud, also referred as text cloud or tag cloud, which is a visual representation of text data.

The procedure of creating word clouds is very simple in R if you know the different steps to execute. The text mining package (tm) and the word cloud generator package (wordcloud) are available in R for helping us to analyze texts and to quickly visualize the keywords as a word cloud.

In this article, we’ll describe, step by step, how to generate word clouds using the R software.

word cloud and text mining, I have a dream speech from Martin luther king

Contents

3 reasons you should use word clouds to present your text data
Who is using word clouds ?
The 5 main steps to create word clouds in R
Go further
Infos

3 reasons you should use word clouds to present your text data

Word clouds add simplicity and clarity. The most used keywords stand out better in a word cloud
Word clouds are a potent communication tool. They are easy to understand, to be shared and are impactful
Word clouds are visually engaging than a table data

Who is using word clouds ?

Researchers : for reporting qualitative data
Marketers : for highlighting the needs and pain points of customers
Educators : to support essential issues
Politicians and journalists
social media sites : to collect, analyze and share user sentiments

The 5 main steps to create word clouds in R

Step 1: Create a text file

In the following examples, I’ll process the “I have a dream speech” from “Martin Luther King” but you can use any text you want :

Copy and paste the text in a plain text file (e.g : ml.txt)
Save the file

Note that, the text should be saved in a plain text (.txt) file format using your favorite text editor.

Step 2 : Install and load the required packages

Type the R code below, to install and load the required packages:

# Install
install.packages("tm")  # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator 
install.packages("RColorBrewer") # color palettes
# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

Step 3 : Text mining

load the text

The text is loaded using Corpus() function from text mining (tm) package. Corpus is a list of a document (in our case, we only have one document).

We start by importing the text file created in Step 1

To import the file saved locally in your computer, type the following R code. You will be asked to choose the text file interactively.

text <- readLines(file.choose())

In the example below, I’ll load a .txt file hosted on STHDA website:

# Read the text file from internet
filePath <- "https://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)

Load the data as a corpus

# Load the data as a corpus
docs <- Corpus(VectorSource(text))

VectorSource() function creates a corpus of character vectors

Inspect the content of the document

inspect(docs)

Text transformation

Transformation is performed using tm_map() function to replace, for example, special characters from the text.

Replacing “/”, “@” and “|” with space:

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

Cleaning the text

the tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like ‘the’, “we”.

The information value of ‘stopwords’ is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analyses. For ‘stopwords’, supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish and swedish. Language names are case sensitive.

I’ll also show you how to make your own list of stopwords to remove from the text.

You could also remove numbers and punctuation with removeNumbers and removePunctuation arguments.

Another important preprocessing step is to make a text stemming which reduces words to their root form. In other words, this process removes suffixes from words to make it simple and to get the common origin. For example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”.

Note that, text stemming require the package ‘SnowballC’.

The R code below can be used to clean your text :

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
# docs <- tm_map(docs, stemDocument)

Step 4 : Build a term-document matrix

Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function TermDocumentMatrix() from text mining package can be used as follow :

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

             word freq
will         will   17
freedom   freedom   13
ring         ring   12
day           day   11
dream       dream   11
let           let   11
every       every    9
able         able    8
one           one    8
together together    7

Step 5 : Generate the Word cloud

The importance of words can be illustrated as a word cloud as follow :

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

word cloud and text mining, I have a dream speech from Martin Luther King

The above word cloud clearly shows that “Will”, “freedom”, “dream”, “day” and “together” are the five most important words in the “I have a dream speech” from Martin Luther King.

Arguments of the word cloud generator function :

words : the words to be plotted
freq : their frequencies
min.freq : words with frequency below min.freq will not be plotted
max.words : maximum number of words to be plotted
random.order : plot words in random order. If false, they will be plotted in decreasing frequency
rot.per : proportion words with 90 degree rotation (vertical text)
colors : color words from least to most frequent. Use, for example, colors =“black” for single color.

Go further

Explore frequent terms and their associations

You can have a look at the frequent terms in the term-document matrix as follow. In the example below we want to find words that occur at least four times :

findFreqTerms(dtm, lowfreq = 4)

 [1] "able"     "day"      "dream"    "every"    "faith"    "free"     "freedom"  "let"      "mountain" "nation"  
[11] "one"      "ring"     "shall"    "together" "will"

You can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs() function. The R code below identifies which words are associated with “freedom” in I have a dream speech :

findAssocs(dtm, terms = "freedom", corlimit = 0.3)

$freedom
         let         ring  mississippi mountainside        stone        every     mountain        state 
        0.89         0.86         0.34         0.34         0.34         0.32         0.32         0.32

The frequency table of words

head(d, 10)

             word freq
will         will   17
freedom   freedom   13
ring         ring   12
day           day   11
dream       dream   11
let           let   11
every       every    9
able         able    8
one           one    8
together together    7

Plot word frequencies

The frequency of the first 10 frequent words are plotted :

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")

word cloud and text mining

Infos

This analysis has been performed using R (ver. 3.3.2).

Word cloud generator in R : One killer function to do everything you need

Wed, 14 Jan 2015 20:35:02 +0100

R tag cloud generator function : rquery.wordcloud
Infos

As you may know, a word cloud (or tag cloud) is a text mining method to find the most frequently used words in a text. The procedure to generate a word cloud using R software has been described in my previous post available here : Text mining and word cloud fundamentals in R : 5 simple steps you should know.

The goal of this tutorial is to provide a simple word cloud generator function in R programming language. This function can be used to create a word cloud from different sources including :

an R object containing plain text
a txt file containing plain text. It works with local and online hosted txt files
A URL of a web page

Creating word clouds requires at least five main text-mining steps (described in my previous post). All theses steps can be performed with one line R code using rquery.wordcloud() function described in the next section.

R tag cloud generator function : rquery.wordcloud

The source code of the function is provided at the end of this page.

Usage

The format of rquery.wordcloud() function is shown below :

rquery.wordcloud(x, type=c("text", "url", "file"), 
        lang="english", excludeWords = NULL, 
        textStemming = FALSE,  colorPalette="Dark2",
        max.words=200)

x : character string (plain text, web URL, txt file path)
type : specify whether x is a plain text, a web page URL or a .txt file path
lang : the language of the text. This is important to be specified in order to remove the common stopwords (like ‘the’, ‘we’, ‘is’, ‘are’) from the text before further analysis. Supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish and swedish.
excludeWords : a vector containing your own stopwords to be eliminated from the text. e.g : c(“word1”, “word2”)
textStemming : reduces words to their root form. Default value is FALSE. A stemming process reduces the words “moving” and “movement” to the root word, “move”.
colorPalette : Possible values are :
- a name of color palette taken from RColorBrewer package (e.g.: colorPalette = “Dark2”)
- color name (e.g. : colorPalette = “red”)
- a color code (e.g. : colorPalette = “#FF1245”)
min.freq : words with frequency below min.freq will not be plotted
max.words : maximum number of words to be plotted. least frequent terms dropped

Note that, rquery.wordcloud() function returns a list, containing two objects : - tdm : term-document matrix which can be explored as illustrated in the next sections. - freqTable : Frequency table of words

Required R packages

The following packages are required for the rquery.wordcloud() function :

tm for text mining
SnowballC for text stemming
wordcloud for generating word cloud images
RCurl and XML packages to download and parse web pages
RColorBrewer for color palettes

Install these packages, before using the function rquery.wordcloud, as follow :

install.packages(c("tm", "SnowballC", "wordcloud", "RColorBrewer", "RCurl", "XML")

Create a word cloud from a plain text file

Plain text file can be easily created using your favorite text editor (e.g : Word). “I have a dream speech” (from Martin Luther King) is processed in the following example but you can use any text you want :

Copy and paste your text in a plain text file
Save the file (e.g : ml.txt)

Generate the word cloud using the R code below :

source('https://www.sthda.com/upload/rquery_wordcloud.r')
filePath <- "https://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
res<-rquery.wordcloud(filePath, type ="file", lang = "english")

Change the arguments max.words and min.freq to plot more words :

max.words : maximum number of words to be plotted.
min.freq : words with frequency below min.freq will not be plotted

res<-rquery.wordcloud(filePath, type ="file", lang = "english",
                 min.freq = 1,  max.words = 200)

The above image clearly shows that “Will”, “freedom”, “dream”, “day” and “together” are the five most frequent words in Martin Luther King “I have a dream speech”.

Change the color of the word cloud

The color of the word cloud can be changed using the argument colorPalette.

Allowed values for colorPalete :

a color name (e.g.: colorPalette = “blue”)
a color code (e.g.: colorPalette = “#FF1425”)
a name of a color palette taken from RColorBrewer package (e.g.: colorPalette = “Dark2”)

The color palettes associated to RColorBrewer package are shown below :

Color palette can be changed as follow :

# Reds color palette
res<-rquery.wordcloud(filePath, type ="file", lang = "english",
                      colorPalette = "Reds")

# RdBu color palette
res<-rquery.wordcloud(filePath, type ="file", lang = "english",
                      colorPalette = "RdBu")

# use unique color
res<-rquery.wordcloud(filePath, type ="file", lang = "english",
                      colorPalette = "black")

Operations on the result of rquery.wordcloud() function

As mentioned above, the result of rquery.wordcloud() is a list containing two objects :

tdm : term-document matrix
freqTable : frequency table

tdm <- res$tdm
freqTable <- res$freqTable

Frequency table of words

The frequency of the first top10 words can be displayed and plotted as follow :

# Show the top10 words and their frequency
head(freqTable, 10)

             word freq
will         will   17
freedom   freedom   13
ring         ring   12
day           day   11
dream       dream   11
let           let   11
every       every    9
able         able    8
one           one    8
together together    7

# Bar plot of the frequency for the top10
barplot(freqTable[1:10,]$freq, las = 2, 
        names.arg = freqTable[1:10,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")

Operations on term-document matrix

You can explore the frequent terms and their associations. In the following example, we want to identify words that occur at least four times :

findFreqTerms(tdm, lowfreq = 4)

 [1] "able"     "day"      "dream"    "every"    "faith"    "free"     "freedom"  "let"      "mountain" "nation"  
[11] "one"      "ring"     "shall"    "together" "will"

You could also analyze the correlation (or association) between frequent terms. The R code below identifies which words are associated with “freedom” in I have a dream speech :

findAssocs(tdm, terms = "freedom", corlimit = 0.3)

             freedom
let             0.89
ring            0.86
mississippi     0.34
mountainside    0.34
stone           0.34
every           0.32
mountain        0.32
state           0.32

Create a word cloud of a web page

In this section we’ll make a tag cloud of the following web page :

https://www.sthda.com/english/wiki/create-and-format-powerpoint-documents-from-r-software

url = "https://www.sthda.com/english/wiki/create-and-format-powerpoint-documents-from-r-software"
rquery.wordcloud(x=url, type="url")

The above word cloud shows that “powerpoint”, “doc”, “slide”, “reporters” are among the most important words on the analyzed web page. This confirms the fact that the article is about creating PowerPoint document using ReporteRs package in R

R code of rquery.wordcloud function

#++++++++++++++++++++++++++++++++++
# rquery.wordcloud() : Word cloud generator
# - https://www.sthda.com
#+++++++++++++++++++++++++++++++++++
# x : character string (plain text, web url, txt file path)
# type : specify whether x is a plain text, a web page url or a file path
# lang : the language of the text
# excludeWords : a vector of words to exclude from the text
# textStemming : reduces words to their root form
# colorPalette : the name of color palette taken from RColorBrewer package, 
  # or a color name, or a color code
# min.freq : words with frequency below min.freq will not be plotted
# max.words : Maximum number of words to be plotted. least frequent terms dropped

# value returned by the function : a list(tdm, freqTable)
rquery.wordcloud <- function(x, type=c("text", "url", "file"), 
                          lang="english", excludeWords=NULL, 
                          textStemming=FALSE,  colorPalette="Dark2",
                          min.freq=3, max.words=200)
{ 
  library("tm")
  library("SnowballC")
  library("wordcloud")
  library("RColorBrewer") 
  
  if(type[1]=="file") text <- readLines(x)
  else if(type[1]=="url") text <- html_to_text(x)
  else if(type[1]=="text") text <- x
  
  # Load the text as a corpus
  docs <- Corpus(VectorSource(text))
  # Convert the text to lower case
  docs <- tm_map(docs, content_transformer(tolower))
  # Remove numbers
  docs <- tm_map(docs, removeNumbers)
  # Remove stopwords for the language 
  docs <- tm_map(docs, removeWords, stopwords(lang))
  # Remove punctuations
  docs <- tm_map(docs, removePunctuation)
  # Eliminate extra white spaces
  docs <- tm_map(docs, stripWhitespace)
  # Remove your own stopwords
  if(!is.null(excludeWords)) 
    docs <- tm_map(docs, removeWords, excludeWords) 
  # Text stemming
  if(textStemming) docs <- tm_map(docs, stemDocument)
  # Create term-document matrix
  tdm <- TermDocumentMatrix(docs)
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  # check the color palette name 
  if(!colorPalette %in% rownames(brewer.pal.info)) colors = colorPalette
  else colors = brewer.pal(8, colorPalette) 
  # Plot the word cloud
  set.seed(1234)
  wordcloud(d$word,d$freq, min.freq=min.freq, max.words=max.words,
            random.order=FALSE, rot.per=0.35, 
            use.r.layout=FALSE, colors=colors)
  
  invisible(list(tdm=tdm, freqTable = d))
}

#++++++++++++++++++++++
# Helper function
#++++++++++++++++++++++
# Download and parse webpage
html_to_text<-function(url){
  library(RCurl)
  library(XML)
  # download html
  html.doc <- getURL(url)  
  #convert to plain text
  doc = htmlParse(html.doc, asText=TRUE)
 # "//text()" returns all text outside of HTML tags.
 # We also don’t want text such as style and script codes
  text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
  # Format text vector into one character string
  return(paste(text, collapse = " "))
}

Infos

This analysis has been performed using R (ver. 3.1.0).

Text mining

Sun, 11 Jan 2015 12:39:42 +0100

This category contains articles about text mining and word cloud. Scroll down to the bottom of the page. You will find many tutorial about how to generate word cloud using R software