Create Document Term Matrix with N-Grams in R









up vote
1
down vote

favorite












I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
But im not able to create it.



I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,



library(tm) 
library(tokenizers)


data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"



data_corpus = Corpus(DataframeSource(data))


custom function for n-gram tokenization :



ngram_tokenizer = function(x)
temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
return(temp)



control list for DTM creation :

1-gram



control_list_unigram = list(tokenize = "words",
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)


for N-gram tokenization



control_list_ngram = list(tokenize = ngram_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)


dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

dim(dtm_unigram)
dim(dtm_ngram)


The dimension of both the dtm's were same.

Please correct me!










share|improve this question



























    up vote
    1
    down vote

    favorite












    I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
    But im not able to create it.



    I searched for possible solution but i didnt get much help.
    For privacy reasons i can not share the data.
    Here is what i have tried,



    library(tm) 
    library(tokenizers)


    data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"



    data_corpus = Corpus(DataframeSource(data))


    custom function for n-gram tokenization :



    ngram_tokenizer = function(x)
    temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
    return(temp)



    control list for DTM creation :

    1-gram



    control_list_unigram = list(tokenize = "words",
    removePunctuation = FALSE,
    removeNumbers = FALSE,
    stopwords = stopwords("english"),
    tolower = T,
    stemming = T,
    weighting = function(x)
    weightTf(x)
    )


    for N-gram tokenization



    control_list_ngram = list(tokenize = ngram_tokenizer,
    removePunctuation = FALSE,
    removeNumbers = FALSE,
    stopwords = stopwords("english"),
    tolower = T,
    stemming = T,
    weighting = function(x)
    weightTf(x)
    )


    dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
    dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

    dim(dtm_unigram)
    dim(dtm_ngram)


    The dimension of both the dtm's were same.

    Please correct me!










    share|improve this question

























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
      But im not able to create it.



      I searched for possible solution but i didnt get much help.
      For privacy reasons i can not share the data.
      Here is what i have tried,



      library(tm) 
      library(tokenizers)


      data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"



      data_corpus = Corpus(DataframeSource(data))


      custom function for n-gram tokenization :



      ngram_tokenizer = function(x)
      temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
      return(temp)



      control list for DTM creation :

      1-gram



      control_list_unigram = list(tokenize = "words",
      removePunctuation = FALSE,
      removeNumbers = FALSE,
      stopwords = stopwords("english"),
      tolower = T,
      stemming = T,
      weighting = function(x)
      weightTf(x)
      )


      for N-gram tokenization



      control_list_ngram = list(tokenize = ngram_tokenizer,
      removePunctuation = FALSE,
      removeNumbers = FALSE,
      stopwords = stopwords("english"),
      tolower = T,
      stemming = T,
      weighting = function(x)
      weightTf(x)
      )


      dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
      dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

      dim(dtm_unigram)
      dim(dtm_ngram)


      The dimension of both the dtm's were same.

      Please correct me!










      share|improve this question















      I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
      But im not able to create it.



      I searched for possible solution but i didnt get much help.
      For privacy reasons i can not share the data.
      Here is what i have tried,



      library(tm) 
      library(tokenizers)


      data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"



      data_corpus = Corpus(DataframeSource(data))


      custom function for n-gram tokenization :



      ngram_tokenizer = function(x)
      temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
      return(temp)



      control list for DTM creation :

      1-gram



      control_list_unigram = list(tokenize = "words",
      removePunctuation = FALSE,
      removeNumbers = FALSE,
      stopwords = stopwords("english"),
      tolower = T,
      stemming = T,
      weighting = function(x)
      weightTf(x)
      )


      for N-gram tokenization



      control_list_ngram = list(tokenize = ngram_tokenizer,
      removePunctuation = FALSE,
      removeNumbers = FALSE,
      stopwords = stopwords("english"),
      tolower = T,
      stemming = T,
      weighting = function(x)
      weightTf(x)
      )


      dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
      dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

      dim(dtm_unigram)
      dim(dtm_ngram)


      The dimension of both the dtm's were same.

      Please correct me!







      r nlp tokenize tm n-gram






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 8 at 13:35









      phiver

      11.9k92634




      11.9k92634










      asked Nov 8 at 13:12









      RANJEET SINGH

      62




      62






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.



          So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).



          That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:



          Input must be a character vector of any length or a list of character
          vectors, each of which has a length of 1.


          when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)



          To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.



          NLP_tokenizer <- function(x) 
          unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)



          This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:



          control_list_ngram = list(tokenize = NLP_tokenizer,
          removePunctuation = FALSE,
          removeNumbers = FALSE,
          stopwords = stopwords("english"),
          tolower = T,
          stemming = T,
          weighting = function(x)
          weightTf(x)
          )


          Personally I would use the quanteda package for all of this work. But for now this should help you.






          share|improve this answer




















          • I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
            – RANJEET SINGH
            Nov 12 at 6:26










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53208483%2fcreate-document-term-matrix-with-n-grams-in-r%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote













          Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.



          So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).



          That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:



          Input must be a character vector of any length or a list of character
          vectors, each of which has a length of 1.


          when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)



          To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.



          NLP_tokenizer <- function(x) 
          unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)



          This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:



          control_list_ngram = list(tokenize = NLP_tokenizer,
          removePunctuation = FALSE,
          removeNumbers = FALSE,
          stopwords = stopwords("english"),
          tolower = T,
          stemming = T,
          weighting = function(x)
          weightTf(x)
          )


          Personally I would use the quanteda package for all of this work. But for now this should help you.






          share|improve this answer




















          • I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
            – RANJEET SINGH
            Nov 12 at 6:26














          up vote
          0
          down vote













          Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.



          So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).



          That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:



          Input must be a character vector of any length or a list of character
          vectors, each of which has a length of 1.


          when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)



          To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.



          NLP_tokenizer <- function(x) 
          unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)



          This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:



          control_list_ngram = list(tokenize = NLP_tokenizer,
          removePunctuation = FALSE,
          removeNumbers = FALSE,
          stopwords = stopwords("english"),
          tolower = T,
          stemming = T,
          weighting = function(x)
          weightTf(x)
          )


          Personally I would use the quanteda package for all of this work. But for now this should help you.






          share|improve this answer




















          • I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
            – RANJEET SINGH
            Nov 12 at 6:26












          up vote
          0
          down vote










          up vote
          0
          down vote









          Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.



          So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).



          That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:



          Input must be a character vector of any length or a list of character
          vectors, each of which has a length of 1.


          when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)



          To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.



          NLP_tokenizer <- function(x) 
          unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)



          This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:



          control_list_ngram = list(tokenize = NLP_tokenizer,
          removePunctuation = FALSE,
          removeNumbers = FALSE,
          stopwords = stopwords("english"),
          tolower = T,
          stemming = T,
          weighting = function(x)
          weightTf(x)
          )


          Personally I would use the quanteda package for all of this work. But for now this should help you.






          share|improve this answer












          Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.



          So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).



          That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:



          Input must be a character vector of any length or a list of character
          vectors, each of which has a length of 1.


          when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)



          To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.



          NLP_tokenizer <- function(x) 
          unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)



          This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:



          control_list_ngram = list(tokenize = NLP_tokenizer,
          removePunctuation = FALSE,
          removeNumbers = FALSE,
          stopwords = stopwords("english"),
          tolower = T,
          stemming = T,
          weighting = function(x)
          weightTf(x)
          )


          Personally I would use the quanteda package for all of this work. But for now this should help you.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 9 at 13:31









          phiver

          11.9k92634




          11.9k92634











          • I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
            – RANJEET SINGH
            Nov 12 at 6:26
















          • I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
            – RANJEET SINGH
            Nov 12 at 6:26















          I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
          – RANJEET SINGH
          Nov 12 at 6:26




          I was looking something similar. I havent used Quanteda so far but I will definitely explore more about Quanteda. Thanks for the help !
          – RANJEET SINGH
          Nov 12 at 6:26

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53208483%2fcreate-document-term-matrix-with-n-grams-in-r%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

          How do I collapse sections of code in Visual Studio Code for Windows?

          ャフサォクコ ケウ,コ,ワ メ,ロスョノ゙,クネ,フムカヤヲニ,エコ゚ツ ウイオン゙ケワサネォキモュキォウイノンコチ゚メヌナイゥフュ,カヒウネェ ネ,ホノケ,ムュキ ッボーミュハ,チ ツス ィ メウイマヤ,゙ウチ ヅ ロ,ォジヌェ ャヌット ェ,マャ,チナエヒネソキツテ トホヲヲミーァ