Comments

Comment

Visitor
Avatar
Very informative and well written article. Thanks.
content_transformer is availaible in package tm version 0.6. Had some difficulty in figuring that out as my Rstudio was installing tm 0.5.x version

Comment

Visitor
Avatar
Why is it that I get "everi" instead of "everything" or "abl" instead of "able"?

Comment

Administrator
Avatar
If you want avoid this, don't do the text stemming step as mentioned in the article:

Quotation :
Another important preprocessing step is to make a text stemming which reduces words to their root form. In other words, this process removes suffixes from words to make it simple and to get the common origin. For example, a stemming process reduces the words “moving”, “moved” and “movement” to the root word, “move”.


Let me know if it works for you.

Good luck!

Comment

Visitor
Avatar
Hi,

There is an mojibake at line 16 in the text file. Could you revise it?

Best, Darli

Comment

Administrator
Avatar
@Darli: Thank you very much!! I fixed it and the mojibake is know removed.
AK

Comment

Visitor
Avatar
HI,
I want to manually assign the number of occurrences of words.I have a text file with 200 distinct words(each word in a new line).I have another file with 200 numbers.How can I assign these numbers as frequency of the words?The words are from different European languages and some of the words are breaking itself when displayed in word cloud.What should I do to get the complete word?

Thanks

Comment

Visitor
Avatar
you are awesome man , things are good here , done great man!!

Comment

Administrator
Avatar
Thank you very much! I really appreciated your comments and it motivates me to continue writing!!

Comment

Visitor
Avatar
It was Awesome... I searched alot on net , this is the only one which worked perfectly... keep it onn

Comment

Visitor
Avatar
Nice article, thanks for sharing! I’d like to propose some improvement. In Step 4 you use this code:
Code R :
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

But it works very slow and final data frame takes lot of memory. In my projects I use the following:
Code R :
library(tm)         # Framework for text mining applications within R
library(slam)       # Data structures and algorithms for sparse arrays and matrices
# function for sorting words in decreasing order
sort_freq <- function(x){
     srt <- sort(row_sums(x, na.rm = T), decreasing = TRUE)
     frf <- data.frame(word = names(srt), freq = srt, row.names = NULL,
                       check.rows = TRUE,  stringsAsFactors = FALSE)
     return(frf)
}
# create term-document matrix
tdm <- TermDocumentMatrix(df_corpus)
# create data frame with words sorted by frequency
d <- sort_freq(tdm)

This code works waaaay faster and resulting data frame takes twice as less memory.
Hope this may be useful.
Regards,
Denis
Connect with me here:
LinkedIn
Pinterest

Comment

Visitor
Avatar
Not sure why in my previous comment code was misformatted, here is correct one with plain text

# function for sorting words in decreasing order
sort_freq <- function(x){
srt <- sort(row_sums(x, na.rm = T), decreasing = TRUE)
frf <- data.frame(word = names(srt), freq = srt, row.names = NULL,
check.rows = TRUE, stringsAsFactors = FALSE)
return(frf)
}

Comment

Administrator
Avatar
Thank you Denis for your input. I'll update the article