<?xml version="1.0" encoding="UTF-8" ?>
<!-- RSS generated by PHPBoost on Tue, 26 May 2026 15:02:36 +0200 -->

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title><![CDATA[Easy Guides]]></title>
		<atom:link href="https://www.sthda.com/english/syndication/rss/wiki/30" rel="self" type="application/rss+xml"/>
		<link>https://www.sthda.com</link>
		<description><![CDATA[Last articles of the category: Text mining]]></description>
		<copyright>(C) 2005-2026 PHPBoost</copyright>
		<language>en</language>
		<generator>PHPBoost</generator>
		
		
		<item>
			<title><![CDATA[Text mining and word cloud fundamentals in R : 5 simple steps you should know]]></title>
			<link>https://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know</link>
			<guid>https://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know</guid>
			<description><![CDATA[<!-- START HTML -->

  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">
<p><strong>Text mining</strong> methods allow us to highlight the most frequently used keywords in a paragraph of texts. One can create a <strong>word cloud</strong>, also referred as <em>text cloud</em> or <em>tag cloud</em>, which is a visual representation of text data.</p>
<p>The procedure of creating word clouds is very simple in R if you know the different steps to execute. The text mining package (<em>tm</em>) and the word cloud generator package (<em>wordcloud</em>) are available in R for helping us to analyze texts and to quickly visualize the keywords as a word cloud.</p>
<p><span class="success">In this article, we’ll describe, step by step, how to generate <strong>word clouds</strong> using the R software.</span></p>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/images/word-cloud.png" alt="word cloud and text mining, I have a dream speech from Martin luther king" />
<p class="caption">word cloud and text mining, I have a dream speech from Martin luther king</p>
</div>
<br/>
<div id="TOC">
  <strong>Contents</strong>
<ul>
<li><a href="#reasons-you-should-use-word-clouds-to-present-your-text-data">3 reasons you should use word clouds to present your text data</a></li>
<li><a href="#who-is-using-word-clouds">Who is using word clouds ?</a></li>
<li><a href="#the-5-main-steps-to-create-word-clouds-in-r">The 5 main steps to create word clouds in R</a><ul>
<li><a href="#step-1-create-a-text-file">Step 1: Create a text file</a></li>
<li><a href="#step-2-install-and-load-the-required-packages">Step 2 : Install and load the required packages</a></li>
<li><a href="#step-3-text-mining">Step 3 : Text mining</a></li>
<li><a href="#step-4-build-a-term-document-matrix">Step 4 : Build a term-document matrix</a></li>
<li><a href="#step-5-generate-the-word-cloud">Step 5 : Generate the Word cloud</a></li>
</ul></li>
<li><a href="#go-further">Go further</a><ul>
<li><a href="#explore-frequent-terms-and-their-associations">Explore frequent terms and their associations</a></li>
<li><a href="#the-frequency-table-of-words">The frequency table of words</a></li>
<li><a href="#plot-word-frequencies">Plot word frequencies</a></li>
</ul></li>
<li><a href="#infos">Infos</a></li>
</ul>
</div>
<p><br/></p>
<div id="reasons-you-should-use-word-clouds-to-present-your-text-data" class="section level2">
<h2>3 reasons you should use word clouds to present your text data</h2>
<ol style="list-style-type: decimal">
<li><strong>Word clouds</strong> add simplicity and clarity. The most used keywords stand out better in a word cloud
</li>
<li><strong>Word clouds</strong> are a potent communication tool. They are easy to understand, to be shared and are impactful</li>
<li><strong>Word clouds</strong> are visually engaging than a table data</li>
</ol>
</div>
<div id="who-is-using-word-clouds" class="section level2">
<h2>Who is using word clouds ?</h2>
<ul>
<li>Researchers : for reporting qualitative data</li>
<li>Marketers : for highlighting the needs and pain points of customers</li>
<li>Educators : to support essential issues</li>
<li>Politicians and journalists</li>
<li>social media sites : to collect, analyze and share user sentiments</li>
</ul>
</div>
<div id="the-5-main-steps-to-create-word-clouds-in-r" class="section level2">
<h2>The 5 main steps to create word clouds in R</h2>
<div id="step-1-create-a-text-file" class="section level3">
<h3>Step 1: Create a text file</h3>
<p>In the following examples, I’ll process the “<strong>I have a dream speech</strong>” from “<strong>Martin Luther King</strong>” but you can use any text you want :</p>
<ul>
<li>Copy and paste the text in a plain text file (e.g : ml.txt)</li>
<li>Save the file</li>
</ul>
<p><span class="warning">Note that, the text should be saved in a plain text (.txt) file format using your favorite text editor.</span></p>
</div>
<div id="step-2-install-and-load-the-required-packages" class="section level3">
<h3>Step 2 : Install and load the required packages</h3>
<p>Type the R code below, to install and load the required packages:</p>
<pre class="r"><code># Install
install.packages("tm")  # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator 
install.packages("RColorBrewer") # color palettes
# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")</code></pre>
</div>
<div id="step-3-text-mining" class="section level3">
<h3>Step 3 : Text mining</h3>
<div id="load-the-text" class="section level4">
<h4>load the text</h4>
<p>The text is loaded using <strong>Corpus()</strong> function from <strong>text mining</strong> (tm) package. Corpus is a list of a document (in our case, we only have one document).</p>
<ol style="list-style-type: decimal">
<li><strong>We start by importing the text file created in Step 1</strong></li>
</ol>
<p>To import the file saved locally in your computer, type the following R code. You will be asked to choose the text file interactively.</p>
<pre class="r"><code>text <- readLines(file.choose())</code></pre>
<p>In the example below, I’ll load a .txt file hosted on STHDA website:</p>
<pre class="r"><code># Read the text file from internet
filePath <- "https://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)</code></pre>
<ol start="2" style="list-style-type: decimal">
<li><strong>Load the data as a corpus</strong></li>
</ol>
<pre class="r"><code># Load the data as a corpus
docs <- Corpus(VectorSource(text))</code></pre>
<p><span class="warning">VectorSource() function creates a corpus of character vectors</span></p>
<ol start="3" style="list-style-type: decimal">
<li><strong>Inspect the content of the document</strong></li>
</ol>
<pre class="r"><code>inspect(docs)</code></pre>
</div>
<div id="text-transformation" class="section level4">
<h4>Text transformation</h4>
<p>Transformation is performed using <strong>tm_map()</strong> function to replace, for example, special characters from the text.</p>
<p>Replacing “/”, “@” and “|” with space:</p>
<pre class="r"><code>toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")</code></pre>
</div>
<div id="cleaning-the-text" class="section level4">
<h4>Cleaning the text</h4>
<p>the <strong>tm_map()</strong> function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like ‘the’, “we”.</p>
<p>The information value of ‘stopwords’ is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analyses. For ‘stopwords’, supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish and swedish. Language names are case sensitive.</p>
<p><span class="success">I’ll also show you how to make your own list of stopwords to remove from the text.</span></p>
<p>You could also remove numbers and punctuation with <strong>removeNumbers</strong> and <strong>removePunctuation</strong> arguments.</p>
<p>Another important preprocessing step is to make a <strong>text stemming</strong> which reduces words to their root form. In other words, this process removes suffixes from words to make it simple and to get the common origin. For example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”.</p>
<p><span class="warning">Note that, text stemming require the package ‘SnowballC’. </span></p>
<p>The R code below can be used to clean your text :</p>
<pre class="r"><code># Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
# docs <- tm_map(docs, stemDocument)</code></pre>
</div>
</div>
<div id="step-4-build-a-term-document-matrix" class="section level3">
<h3>Step 4 : Build a term-document matrix</h3>
<p>Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function <em>TermDocumentMatrix()</em> from <strong>text mining</strong> package can be used as follow :</p>
<pre class="r"><code>dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)</code></pre>
<pre><code>             word freq
will         will   17
freedom   freedom   13
ring         ring   12
day           day   11
dream       dream   11
let           let   11
every       every    9
able         able    8
one           one    8
together together    7</code></pre>
</div>
<div id="step-5-generate-the-word-cloud" class="section level3">
<h3>Step 5 : Generate the Word cloud</h3>
<p>The importance of words can be illustrated as a <strong>word cloud</strong> as follow :</p>
<pre class="r"><code>set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-martin-luther-king-i-have-a-dream-speech.png" alt="word cloud and text mining, I have a dream speech from Martin Luther King" width="480" style="margin-bottom:10px;" />
<p class="caption">
word cloud and text mining, I have a dream speech from Martin Luther King
</p>
</div>
<p>The above <strong>word cloud</strong> clearly shows that “Will”, “freedom”, “dream”, “day” and “together” are the five most important words in the “<strong>I have a dream speech</strong>” from <strong>Martin Luther King</strong>.</p>
<p>Arguments of the <strong>word cloud generator</strong> function :</p>
<br/>
<div class="block">
<ul>
<li>words : the words to be plotted</li>
<li>freq : their frequencies</li>
<li>min.freq : words with frequency below min.freq will not be plotted</li>
<li>max.words : maximum number of words to be plotted</li>
<li>random.order : plot words in random order. If false, they will be plotted in decreasing frequency</li>
<li>rot.per : proportion words with 90 degree rotation (vertical text)</li>
<li>colors : color words from least to most frequent. Use, for example, colors =“black” for single color.</li>
</ul>
</div>
<p><br/></p>
</div>
</div>
<div id="go-further" class="section level2">
<h2>Go further</h2>
<div id="explore-frequent-terms-and-their-associations" class="section level3">
<h3>Explore frequent terms and their associations</h3>
<p>You can have a look at the frequent terms in the term-document matrix as follow. In the example below we want to find words that occur at least four times :</p>
<pre class="r"><code>findFreqTerms(dtm, lowfreq = 4)</code></pre>
<pre><code> [1] "able"     "day"      "dream"    "every"    "faith"    "free"     "freedom"  "let"      "mountain" "nation"  
[11] "one"      "ring"     "shall"    "together" "will"    </code></pre>
<p>You can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs() function. The R code below identifies which words are associated with “freedom” in <strong>I have a dream speech</strong> :</p>
<pre class="r"><code>findAssocs(dtm, terms = "freedom", corlimit = 0.3)</code></pre>
<pre><code>$freedom
         let         ring  mississippi mountainside        stone        every     mountain        state 
        0.89         0.86         0.34         0.34         0.34         0.32         0.32         0.32 </code></pre>
</div>
<div id="the-frequency-table-of-words" class="section level3">
<h3>The frequency table of words</h3>
<pre class="r"><code>head(d, 10)</code></pre>
<pre><code>             word freq
will         will   17
freedom   freedom   13
ring         ring   12
day           day   11
dream       dream   11
let           let   11
every       every    9
able         able    8
one           one    8
together together    7</code></pre>
</div>
<div id="plot-word-frequencies" class="section level3">
<h3>Plot word frequencies</h3>
<p>The frequency of the first 10 frequent words are plotted :</p>
<pre class="r"><code>barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")</code></pre>
<div class="figure">
<img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-frequency-plot.png" alt="word cloud and text mining" width="384" style="margin-bottom:10px;" />
<p class="caption">
word cloud and text mining
</p>
</div>
</div>
</div>
<div id="infos" class="section level1">
<h1>Infos</h1>
<p><span class="warning"> This analysis has been performed using R (ver. 3.3.2). </span></p>
</div>
<script>jQuery(document).ready(function () {
    jQuery('#rdoc h1').addClass('wiki_paragraph1');
    jQuery('#rdoc h2').addClass('wiki_paragraph2');
    jQuery('#rdoc h3').addClass('wiki_paragraph3');
    jQuery('#rdoc h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->

<!-- END HTML -->]]></description>
			<pubDate>Sun, 12 Feb 2017 05:23:30 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Word cloud generator in R : One killer function to do everything you need]]></title>
			<link>https://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need</link>
			<guid>https://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need</guid>
			<description><![CDATA[<!-- START HTML -->

            
  <!--====================== start from here when you copy to sthda================-->  
  <div id="rdoc">

<div id="TOC">
<ul>
<li><a href="#r-tag-cloud-generator-function-rquery.wordcloud">R tag cloud generator function : rquery.wordcloud</a><ul>
<li><a href="#usage">Usage</a></li>
<li><a href="#required-r-packages">Required R packages</a></li>
<li><a href="#create-a-word-cloud-from-a-plain-text-file">Create a word cloud from a plain text file</a></li>
<li><a href="#change-the-color-of-the-word-cloud">Change the color of the word cloud</a></li>
<li><a href="#operations-on-the-result-of-rquery.wordcloud-function">Operations on the result of rquery.wordcloud() function</a><ul>
<li><a href="#frequency-table-of-words">Frequency table of words</a></li>
<li><a href="#operations-on-term-document-matrix">Operations on term-document matrix</a></li>
</ul></li>
<li><a href="#create-a-word-cloud-of-a-web-page">Create a word cloud of a web page</a></li>
<li><a href="#r-code-of-rquery.wordcloud-function">R code of rquery.wordcloud function</a></li>
</ul></li>
<li><a href="#infos">Infos</a></li>
</ul>
</div>

<p><br/></p>
<p>As you may know, a <strong>word cloud</strong> (or <strong>tag cloud</strong>) is a <strong>text mining</strong> method to find the most frequently used words in a text. The procedure to generate a <strong>word cloud</strong> using <strong>R software</strong> has been described in my previous post available here : <a href="https://www.sthda.com/english/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know">Text mining and word cloud fundamentals in R : 5 simple steps you should know</a>.</p>
<p>The goal of this tutorial is to provide a simple <strong>word cloud generator</strong> function in <strong>R programming</strong> language. This function can be used to create a word cloud from different sources including :</p>
<ul>
<li>an R object containing plain text</li>
<li>a txt file containing plain text. It works with local and online hosted txt files</li>
<li>A URL of a web page</li>
</ul>
<p><img src="https://www.sthda.com/english/sthda/RDoc/images/word-cloud.png" alt="tag cloud generator, word cloud and text mining, I have a dream speech from Martin luther king" /></p>
<p>Creating word clouds requires at least <a href="https://www.sthda.com/english/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know">five main text-mining steps</a> (described in my previous post). All theses steps can be performed with one line R code using <strong>rquery.wordcloud()</strong> function described in the next section.</p>
<div id="r-tag-cloud-generator-function-rquery.wordcloud" class="section level1">
<h1>R tag cloud generator function : rquery.wordcloud</h1>
<p>The source code of the function is provided at the end of this page.</p>
<div id="usage" class="section level2">
<h2>Usage</h2>
<p>The format of <em>rquery.wordcloud()</em> function is shown below :</p>
<pre class="r"><code>rquery.wordcloud(x, type=c("text", "url", "file"), 
        lang="english", excludeWords = NULL, 
        textStemming = FALSE,  colorPalette="Dark2",
        max.words=200)</code></pre>
<br/>
<div class="block">
<ul>
<li><strong>x</strong> : character string (plain text, web URL, txt file path)</li>
<li><strong>type</strong> : specify whether x is a plain text, a web page URL or a .txt file path</li>
<li><strong>lang</strong> : the language of the text. This is important to be specified in order to remove the common stopwords (like ‘the’, ‘we’, ‘is’, ‘are’) from the text before further analysis. Supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish and swedish.</li>
<li><strong>excludeWords</strong> : a vector containing your own stopwords to be eliminated from the text. e.g : c(“word1”, “word2”)</li>
<li><strong>textStemming</strong> : reduces words to their root form. Default value is FALSE. A stemming process reduces the words “moving” and “movement” to the root word, “move”.</li>
<li><strong>colorPalette</strong> : Possible values are :
<ul>
<li>a name of color palette taken from RColorBrewer package (e.g.: colorPalette = “Dark2”)</li>
<li>color name (e.g. : colorPalette = “red”)</li>
<li>a color code (e.g. : colorPalette = “#FF1245”)</li>
</ul></li>
<li><strong>min.freq</strong> : words with frequency below min.freq will not be plotted</li>
<li><strong>max.words</strong> : maximum number of words to be plotted. least frequent terms dropped</li>
</ul>
</div>
<p><br/></p>
<br/>
<div class="warning">
Note that, rquery.wordcloud() function returns a list, containing two objects :
- tdm : <strong>term-document matrix</strong> which can be explored as illustrated in the next sections. - freqTable : <strong>Frequency table of words</strong>
</div>
<p><br/></p>
</div>
<div id="required-r-packages" class="section level2">
<h2>Required R packages</h2>
<p>The following packages are required for the <strong>rquery.wordcloud()</strong> function :</p>
<ul>
<li><strong>tm</strong> for <strong>text mining</strong></li>
<li><strong>SnowballC</strong> for <strong>text stemming</strong></li>
<li><strong>wordcloud</strong> for generating word cloud images</li>
<li><strong>RCurl</strong> and <strong>XML</strong> packages to download and parse web pages</li>
<li><strong>RColorBrewer</strong> for color palettes</li>
</ul>
<p>Install these packages, before using the function <strong>rquery.wordcloud</strong>, as follow :</p>
<pre class="r"><code>install.packages(c("tm", "SnowballC", "wordcloud", "RColorBrewer", "RCurl", "XML")</code></pre>
</div>
<div id="create-a-word-cloud-from-a-plain-text-file" class="section level2">
<h2>Create a word cloud from a plain text file</h2>
<p>Plain text file can be easily created using your favorite text editor (e.g : Word). “<strong>I have a dream speech</strong>” (from <strong>Martin Luther King</strong>) is processed in the following example but you can use any text you want :</p>
<ul>
<li>Copy and paste your text in a plain text file</li>
<li>Save the file (e.g : ml.txt)</li>
</ul>
<p>Generate the word cloud using the R code below :</p>
<pre class="r"><code>source(&amp;#39;https://www.sthda.com/upload/rquery_wordcloud.r&amp;#39;)
filePath <- "https://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
res<-rquery.wordcloud(filePath, type ="file", lang = "english")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-generator-martin-luther-king-i-have-a-dream-file.png" title="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" alt="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" width="288" style="margin-bottom:10px;" /></p>
<p>Change the arguments <em>max.words</em> and <em>min.freq</em> to plot more words :</p>
<ul>
<li>max.words : maximum number of words to be plotted.</li>
<li>min.freq : words with frequency below min.freq will not be plotted</li>
</ul>
<pre class="r"><code>res<-rquery.wordcloud(filePath, type ="file", lang = "english",
                 min.freq = 1,  max.words = 200)</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-generator-martin-luther-king-i-have-a-dream-file-2.png" title="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" alt="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" width="432" style="margin-bottom:10px;" /></p>
<p><span class="success">The above image clearly shows that “Will”, “freedom”, “dream”, “day” and “together” are the five most frequent words in <strong>Martin Luther King</strong> “<strong>I have a dream speech</strong>”.</span></p>
</div>
<div id="change-the-color-of-the-word-cloud" class="section level2">
<h2>Change the color of the word cloud</h2>
<p>The color of the word cloud can be changed using the argument <em>colorPalette</em>.</p>
<p>Allowed values for <em>colorPalete</em> :</p>
<ul>
<li>a color name (e.g.: colorPalette = “blue”)</li>
<li>a color code (e.g.: colorPalette = “#FF1425”)</li>
<li>a name of a color palette taken from <strong>RColorBrewer</strong> package (e.g.: colorPalette = “Dark2”)</li>
</ul>
<p>The color palettes associated to <strong>RColorBrewer</strong> package are shown below :</p>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-generator-rcolorbrewer-palettes.png" title="Rcolorbrewer palettes" alt="Rcolorbrewer palettes" width="336" style="margin-bottom:10px;" /></p>
<p>Color palette can be changed as follow :</p>
<pre class="r"><code># Reds color palette
res<-rquery.wordcloud(filePath, type ="file", lang = "english",
                      colorPalette = "Reds")

# RdBu color palette
res<-rquery.wordcloud(filePath, type ="file", lang = "english",
                      colorPalette = "RdBu")

# use unique color
res<-rquery.wordcloud(filePath, type ="file", lang = "english",
                      colorPalette = "black")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-generator-martin-luther-king-i-have-a-dream-file-change-color1.png" title="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" alt="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" width="240" style="margin-bottom:10px;" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-generator-martin-luther-king-i-have-a-dream-file-change-color2.png" title="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" alt="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" width="240" style="margin-bottom:10px;" /><img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-generator-martin-luther-king-i-have-a-dream-file-change-color3.png" title="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" alt="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" width="240" style="margin-bottom:10px;" /></p>
</div>
<div id="operations-on-the-result-of-rquery.wordcloud-function" class="section level2">
<h2>Operations on the result of rquery.wordcloud() function</h2>
<p>As mentioned above, the result of <em>rquery.wordcloud()</em> is a list containing two objects :</p>
<ul>
<li>tdm : term-document matrix</li>
<li>freqTable : frequency table</li>
</ul>
<pre class="r"><code>tdm <- res$tdm
freqTable <- res$freqTable</code></pre>
<div id="frequency-table-of-words" class="section level3">
<h3>Frequency table of words</h3>
<p>The frequency of the first top10 words can be displayed and plotted as follow :</p>
<pre class="r"><code># Show the top10 words and their frequency
head(freqTable, 10)</code></pre>
<pre><code>             word freq
will         will   17
freedom   freedom   13
ring         ring   12
day           day   11
dream       dream   11
let           let   11
every       every    9
able         able    8
one           one    8
together together    7</code></pre>
<pre class="r"><code># Bar plot of the frequency for the top10
barplot(freqTable[1:10,]$freq, las = 2, 
        names.arg = freqTable[1:10,]$word,
        col ="lightblue", main ="Most frequent words",
        ylab = "Word frequencies")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-generator-word-cloud-frequency.png" title="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" alt="text mining, word cloud, tag cloud generator, martin luther king, i have a dream speech" width="336" style="margin-bottom:10px;" /></p>
</div>
<div id="operations-on-term-document-matrix" class="section level3">
<h3>Operations on term-document matrix</h3>
<p>You can explore the frequent terms and their associations. In the following example, we want to identify words that occur at least four times :</p>
<pre class="r"><code>findFreqTerms(tdm, lowfreq = 4)</code></pre>
<pre><code> [1] "able"     "day"      "dream"    "every"    "faith"    "free"     "freedom"  "let"      "mountain" "nation"  
[11] "one"      "ring"     "shall"    "together" "will"    </code></pre>
<p>You could also analyze the correlation (or association) between frequent terms. The R code below identifies which words are associated with “freedom” in <strong>I have a dream speech</strong> :</p>
<pre class="r"><code>findAssocs(tdm, terms = "freedom", corlimit = 0.3)</code></pre>
<pre><code>             freedom
let             0.89
ring            0.86
mississippi     0.34
mountainside    0.34
stone           0.34
every           0.32
mountain        0.32
state           0.32</code></pre>
</div>
</div>
<div id="create-a-word-cloud-of-a-web-page" class="section level2">
<h2>Create a word cloud of a web page</h2>
<p>In this section we’ll make a <strong>tag cloud</strong> of the following web page :</p>
<p><a href="https://www.sthda.com/english/english/wiki/create-and-format-powerpoint-documents-from-r-software">https://www.sthda.com/english/wiki/create-and-format-powerpoint-documents-from-r-software</a></p>
<pre class="r"><code>url = "https://www.sthda.com/english/wiki/create-and-format-powerpoint-documents-from-r-software"
rquery.wordcloud(x=url, type="url")</code></pre>
<p><img src="https://www.sthda.com/english/sthda/RDoc/figure/text-mining/word-cloud-generator-web-page.png" title="text mining, word cloud, tag cloud generator" alt="text mining, word cloud, tag cloud generator" width="432" style="margin-bottom:10px;" /></p>
<p><span class="success">The above word cloud shows that “powerpoint”, “doc”, “slide”, “reporters” are among the most important words on the analyzed web page. This confirms the fact that the article is about creating PowerPoint document using ReporteRs package in R </span></p>
</div>
<div id="r-code-of-rquery.wordcloud-function" class="section level2">
<h2>R code of rquery.wordcloud function</h2>
<pre class="r"><code>#++++++++++++++++++++++++++++++++++
# rquery.wordcloud() : Word cloud generator
# - https://www.sthda.com
#+++++++++++++++++++++++++++++++++++
# x : character string (plain text, web url, txt file path)
# type : specify whether x is a plain text, a web page url or a file path
# lang : the language of the text
# excludeWords : a vector of words to exclude from the text
# textStemming : reduces words to their root form
# colorPalette : the name of color palette taken from RColorBrewer package, 
  # or a color name, or a color code
# min.freq : words with frequency below min.freq will not be plotted
# max.words : Maximum number of words to be plotted. least frequent terms dropped

# value returned by the function : a list(tdm, freqTable)
rquery.wordcloud <- function(x, type=c("text", "url", "file"), 
                          lang="english", excludeWords=NULL, 
                          textStemming=FALSE,  colorPalette="Dark2",
                          min.freq=3, max.words=200)
{ 
  library("tm")
  library("SnowballC")
  library("wordcloud")
  library("RColorBrewer") 
  
  if(type[1]=="file") text <- readLines(x)
  else if(type[1]=="url") text <- html_to_text(x)
  else if(type[1]=="text") text <- x
  
  # Load the text as a corpus
  docs <- Corpus(VectorSource(text))
  # Convert the text to lower case
  docs <- tm_map(docs, content_transformer(tolower))
  # Remove numbers
  docs <- tm_map(docs, removeNumbers)
  # Remove stopwords for the language 
  docs <- tm_map(docs, removeWords, stopwords(lang))
  # Remove punctuations
  docs <- tm_map(docs, removePunctuation)
  # Eliminate extra white spaces
  docs <- tm_map(docs, stripWhitespace)
  # Remove your own stopwords
  if(!is.null(excludeWords)) 
    docs <- tm_map(docs, removeWords, excludeWords) 
  # Text stemming
  if(textStemming) docs <- tm_map(docs, stemDocument)
  # Create term-document matrix
  tdm <- TermDocumentMatrix(docs)
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  # check the color palette name 
  if(!colorPalette %in% rownames(brewer.pal.info)) colors = colorPalette
  else colors = brewer.pal(8, colorPalette) 
  # Plot the word cloud
  set.seed(1234)
  wordcloud(d$word,d$freq, min.freq=min.freq, max.words=max.words,
            random.order=FALSE, rot.per=0.35, 
            use.r.layout=FALSE, colors=colors)
  
  invisible(list(tdm=tdm, freqTable = d))
}

#++++++++++++++++++++++
# Helper function
#++++++++++++++++++++++
# Download and parse webpage
html_to_text<-function(url){
  library(RCurl)
  library(XML)
  # download html
  html.doc <- getURL(url)  
  #convert to plain text
  doc = htmlParse(html.doc, asText=TRUE)
 # "//text()" returns all text outside of HTML tags.
 # We also don’t want text such as style and script codes
  text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
  # Format text vector into one character string
  return(paste(text, collapse = " "))
}</code></pre>
</div>
</div>
<div id="infos" class="section level1">
<h1>Infos</h1>
<p><span class="warning"> This analysis has been performed using R (ver. 3.1.0). </span></p>
</div>

<script>jQuery(document).ready(function () {
    jQuery('h1').addClass('wiki_paragraph1');
    jQuery('h2').addClass('wiki_paragraph2');
    jQuery('h3').addClass('wiki_paragraph3');
    jQuery('h4').addClass('wiki_paragraph4');
    });//add phpboost class to header</script>
<style>.content{padding:0px;}</style>
</div><!--end rdoc-->
<!--====================== stop here when you copy to sthda================-->


<!-- END HTML -->]]></description>
			<pubDate>Wed, 14 Jan 2015 20:35:02 +0100</pubDate>
			
		</item>
		
		<item>
			<title><![CDATA[Text mining]]></title>
			<link>https://www.sthda.com/english/wiki/text-mining</link>
			<guid>https://www.sthda.com/english/wiki/text-mining</guid>
			<description><![CDATA[This category contains articles about <strong>text mining</strong> and <strong>word cloud</strong>. Scroll down to the bottom of the page. You will find many tutorial about how to generate word cloud using <strong>R software</strong>]]></description>
			<pubDate>Sun, 11 Jan 2015 12:39:42 +0100</pubDate>
			
		</item>
		
	</channel>
</rss>
