Create Document Term Matrix with N-Grams in R

up vote
1
down vote

favorite

I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
But im not able to create it.

I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,

library(tm) 
library(tokenizers)

data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"

data_corpus = Corpus(DataframeSource(data))

custom function for n-gram tokenization :

ngram_tokenizer = function(x)
 temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
 return(temp)

control list for DTM creation :

1-gram

control_list_unigram = list(tokenize = "words",
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
)

for N-gram tokenization

control_list_ngram = list(tokenize = ngram_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )


dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

dim(dtm_unigram)
dim(dtm_ngram)

The dimension of both the dtm's were same.

Please correct me!

edited Nov 8 at 13:35

phiver

11.9k92634

asked Nov 8 at 13:12

RANJEET SINGH

add a comment |

up vote
1
down vote

favorite

I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,

library(tm) 
library(tokenizers)

data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"

data_corpus = Corpus(DataframeSource(data))

custom function for n-gram tokenization :

ngram_tokenizer = function(x)
 temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
 return(temp)

control list for DTM creation :

1-gram

control_list_unigram = list(tokenize = "words",
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
)

for N-gram tokenization

control_list_ngram = list(tokenize = ngram_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )


dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

dim(dtm_unigram)
dim(dtm_ngram)

The dimension of both the dtm's were same.

Please correct me!

edited Nov 8 at 13:35

phiver

11.9k92634

asked Nov 8 at 13:12

RANJEET SINGH

add a comment |

up vote
1
down vote

favorite

I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,

library(tm) 
library(tokenizers)

data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"

data_corpus = Corpus(DataframeSource(data))

custom function for n-gram tokenization :

ngram_tokenizer = function(x)
 temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
 return(temp)

control list for DTM creation :

1-gram

control_list_unigram = list(tokenize = "words",
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
)

for N-gram tokenization

control_list_ngram = list(tokenize = ngram_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )


dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

dim(dtm_unigram)
dim(dtm_ngram)

The dimension of both the dtm's were same.

Please correct me!

edited Nov 8 at 13:35

phiver

11.9k92634

asked Nov 8 at 13:12

RANJEET SINGH

I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,

library(tm) 
library(tokenizers)

data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"

data_corpus = Corpus(DataframeSource(data))

custom function for n-gram tokenization :

ngram_tokenizer = function(x)
 temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
 return(temp)

control list for DTM creation :

1-gram

control_list_unigram = list(tokenize = "words",
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
)

for N-gram tokenization

control_list_ngram = list(tokenize = ngram_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )


dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

dim(dtm_unigram)
dim(dtm_ngram)

The dimension of both the dtm's were same.

Please correct me!

r nlp tokenize tm n-gram

edited Nov 8 at 13:35

phiver

11.9k92634

asked Nov 8 at 13:12

RANJEET SINGH

edited Nov 8 at 13:35

phiver

11.9k92634

asked Nov 8 at 13:12

RANJEET SINGH

edited Nov 8 at 13:35

phiver

11.9k92634

edited Nov 8 at 13:35

phiver

11.9k92634

edited Nov 8 at 13:35

phiver

11.9k92634

asked Nov 8 at 13:12

RANJEET SINGH

asked Nov 8 at 13:12

RANJEET SINGH

asked Nov 8 at 13:12

RANJEET SINGH

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.

So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).

That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:

Input must be a character vector of any length or a list of character
 vectors, each of which has a length of 1.

when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.

NLP_tokenizer <- function(x) 
 unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)

This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:

control_list_ngram = list(tokenize = NLP_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )

Personally I would use the quanteda package for all of this work. But for now this should help you.

answered Nov 9 at 13:31

phiver

11.9k92634

I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53208483%2fcreate-document-term-matrix-with-n-grams-in-r%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.

So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).

That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:

Input must be a character vector of any length or a list of character
 vectors, each of which has a length of 1.

when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.

NLP_tokenizer <- function(x) 
 unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)

This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:

control_list_ngram = list(tokenize = NLP_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )

Personally I would use the quanteda package for all of this work. But for now this should help you.

answered Nov 9 at 13:31

phiver

11.9k92634

I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26

add a comment |

up vote
0
down vote

Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.

So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).

That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:

Input must be a character vector of any length or a list of character
 vectors, each of which has a length of 1.

when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.

NLP_tokenizer <- function(x) 
 unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)

This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:

control_list_ngram = list(tokenize = NLP_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )

Personally I would use the quanteda package for all of this work. But for now this should help you.

answered Nov 9 at 13:31

phiver

11.9k92634

I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26

add a comment |

up vote
0
down vote

Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.

So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).

That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:

Input must be a character vector of any length or a list of character
 vectors, each of which has a length of 1.

when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.

NLP_tokenizer <- function(x) 
 unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)

This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:

control_list_ngram = list(tokenize = NLP_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )

Personally I would use the quanteda package for all of this work. But for now this should help you.

answered Nov 9 at 13:31

phiver

11.9k92634

Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.

So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).

That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:

Input must be a character vector of any length or a list of character
 vectors, each of which has a length of 1.

when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.

NLP_tokenizer <- function(x) 
 unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)

This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:

control_list_ngram = list(tokenize = NLP_tokenizer,
 removePunctuation = FALSE,
 removeNumbers = FALSE, 
 stopwords = stopwords("english"), 
 tolower = T, 
 stemming = T, 
 weighting = function(x)
 weightTf(x)
 )

Personally I would use the quanteda package for all of this work. But for now this should help you.

answered Nov 9 at 13:31

phiver

11.9k92634

answered Nov 9 at 13:31

phiver

11.9k92634

answered Nov 9 at 13:31

phiver

11.9k92634

answered Nov 9 at 13:31

phiver

11.9k92634

I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26

add a comment |

I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26

I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
– RANJEET SINGH
Nov 12 at 6:26

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt