Create Document Term Matrix with N-Grams in R
up vote
1
down vote
favorite
I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
But im not able to create it.
I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,
library(tm)
library(tokenizers)
data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"
data_corpus = Corpus(DataframeSource(data))
custom function for n-gram tokenization :
ngram_tokenizer = function(x)
temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
return(temp)
control list for DTM creation :
1-gram
control_list_unigram = list(tokenize = "words",
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
for N-gram tokenization
control_list_ngram = list(tokenize = ngram_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
dim(dtm_unigram)
dim(dtm_ngram)
The dimension of both the dtm's were same.
Please correct me!
r nlp tokenize tm n-gram
add a comment |
up vote
1
down vote
favorite
I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
But im not able to create it.
I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,
library(tm)
library(tokenizers)
data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"
data_corpus = Corpus(DataframeSource(data))
custom function for n-gram tokenization :
ngram_tokenizer = function(x)
temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
return(temp)
control list for DTM creation :
1-gram
control_list_unigram = list(tokenize = "words",
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
for N-gram tokenization
control_list_ngram = list(tokenize = ngram_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
dim(dtm_unigram)
dim(dtm_ngram)
The dimension of both the dtm's were same.
Please correct me!
r nlp tokenize tm n-gram
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
But im not able to create it.
I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,
library(tm)
library(tokenizers)
data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"
data_corpus = Corpus(DataframeSource(data))
custom function for n-gram tokenization :
ngram_tokenizer = function(x)
temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
return(temp)
control list for DTM creation :
1-gram
control_list_unigram = list(tokenize = "words",
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
for N-gram tokenization
control_list_ngram = list(tokenize = ngram_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
dim(dtm_unigram)
dim(dtm_ngram)
The dimension of both the dtm's were same.
Please correct me!
r nlp tokenize tm n-gram
I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
But im not able to create it.
I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,
library(tm)
library(tokenizers)
data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"
data_corpus = Corpus(DataframeSource(data))
custom function for n-gram tokenization :
ngram_tokenizer = function(x)
temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
return(temp)
control list for DTM creation :
1-gram
control_list_unigram = list(tokenize = "words",
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
for N-gram tokenization
control_list_ngram = list(tokenize = ngram_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
dim(dtm_unigram)
dim(dtm_ngram)
The dimension of both the dtm's were same.
Please correct me!
r nlp tokenize tm n-gram
r nlp tokenize tm n-gram
edited Nov 8 at 13:35
phiver
11.9k92634
11.9k92634
asked Nov 8 at 13:12
RANJEET SINGH
62
62
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.
So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).
That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:
Input must be a character vector of any length or a list of character
vectors, each of which has a length of 1.
when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.
NLP_tokenizer <- function(x)
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
Personally I would use the quanteda package for all of this work. But for now this should help you.
I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.
So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).
That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:
Input must be a character vector of any length or a list of character
vectors, each of which has a length of 1.
when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.
NLP_tokenizer <- function(x)
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
Personally I would use the quanteda package for all of this work. But for now this should help you.
I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26
add a comment |
up vote
0
down vote
Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.
So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).
That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:
Input must be a character vector of any length or a list of character
vectors, each of which has a length of 1.
when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.
NLP_tokenizer <- function(x)
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
Personally I would use the quanteda package for all of this work. But for now this should help you.
I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26
add a comment |
up vote
0
down vote
up vote
0
down vote
Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.
So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).
That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:
Input must be a character vector of any length or a list of character
vectors, each of which has a length of 1.
when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.
NLP_tokenizer <- function(x)
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
Personally I would use the quanteda package for all of this work. But for now this should help you.
Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.
So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).
That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:
Input must be a character vector of any length or a list of character
vectors, each of which has a length of 1.
when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.
NLP_tokenizer <- function(x)
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
Personally I would use the quanteda package for all of this work. But for now this should help you.
answered Nov 9 at 13:31
phiver
11.9k92634
11.9k92634
I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26
add a comment |
I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26
I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26
I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53208483%2fcreate-document-term-matrix-with-n-grams-in-r%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown