Comments

Comment

Visitor
Avatar
Very Good Explanation!

Thank you!

Comment

Visitor
Avatar
Salut Alboukadel,

I am using text mining to have a glimpse of topics and to "check theory" (literature claims) of international meetings over a period of time. I am analyzing specific comissions at the WTO and this way an overview of what is going on in these meetings is of great use. I am using a timespan of 10 years which is a lot to get done by hand but not for your method.

It would be nice though to be able to get some words right. Like "market access", "developing coutries", "saudi arabia", "carbon capture"... they appear separated in the world cloud or freq tables. I tried to google that but without success.

I found other approaches in books but found them hard to apply. Yours did the job for me, thanks for posting.

Comment

Administrator
Avatar
I would like to thank all of you for your comments. I'll take your request into account to improve the article.

Comment

Administrator
Avatar
I'm the source author of this article, which has been re-published, by Rohan Chikorde on linked, without citing the source. He did a plagiarism!!!

Comment

Visitor
Avatar
Great article Kassambara.

I have one question i can't figure out how to do. How can you make the associates turn into a wordcloud as well.
So create a wordcloud based on all information, then get associates based on main word in wordcloud and finally a wordcloud based on the main word and associates.

I know they will all be visible with the same size, because they exist one time, but i think it would be really cool if you can create a wordcloud based on associates.

Comment

Visitor
Avatar
Thank for this very informative article! Is it possible to use a similar process to identify top phrases, or to modify a word in the data frame into a phrase? For example I would want to get the frequency for every instance of thanks and thank you (added together, which the text stemming will accomplish). But in the display of the word cloud I would like to see "thank you" not just "thank".

Comment

Visitor
Avatar
Thanks for this nice introduction.

I agree with Denis that the code to get the word frequency is far from ideal.
I just crashed my computer when applying it to my case of over 70 million characters, despite having 64GB of RAM.

I obtained the expected word frequency table instantaneously and without using much of RAM at all by using data.table:

Code R :
 
dtm <- TermDocumentMatrix(docs)
 
dtml <- dtm
class(dtml) <- 'list'
d <- as.data.table(dtml[c('i', 'j', 'v')])
rownms <- dtml$dimnames$Terms
colnms <- dtml$dimnames$Docs
d[, terms := rownms[i]]
d[, docs  := colnms[j]]
 
wfreq <- d[, .(freq=sum(v)), terms][order(-freq)]
 

Comment

Administrator
Avatar
Thank you Denis for your input. I'll update the article

Comment

Visitor
Avatar
Not sure why in my previous comment code was misformatted, here is correct one with plain text

# function for sorting words in decreasing order
sort_freq <- function(x){
srt <- sort(row_sums(x, na.rm = T), decreasing = TRUE)
frf <- data.frame(word = names(srt), freq = srt, row.names = NULL,
check.rows = TRUE, stringsAsFactors = FALSE)
return(frf)
}

Comment

Visitor
Avatar
Nice article, thanks for sharing! I’d like to propose some improvement. In Step 4 you use this code:
Code R :
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

But it works very slow and final data frame takes lot of memory. In my projects I use the following:
Code R :
library(tm)         # Framework for text mining applications within R
library(slam)       # Data structures and algorithms for sparse arrays and matrices
# function for sorting words in decreasing order
sort_freq <- function(x){
     srt <- sort(row_sums(x, na.rm = T), decreasing = TRUE)
     frf <- data.frame(word = names(srt), freq = srt, row.names = NULL,
                       check.rows = TRUE,  stringsAsFactors = FALSE)
     return(frf)
}
# create term-document matrix
tdm <- TermDocumentMatrix(df_corpus)
# create data frame with words sorted by frequency
d <- sort_freq(tdm)

This code works waaaay faster and resulting data frame takes twice as less memory.
Hope this may be useful.
Regards,
Denis
Connect with me here:
LinkedIn
Pinterest

Comment

Visitor
Avatar
It was Awesome... I searched alot on net , this is the only one which worked perfectly... keep it onn

Comment

Administrator
Avatar
Thank you very much! I really appreciated your comments and it motivates me to continue writing!!