writing a AND query for to find matching documents within a dataset (python)










3















I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.



First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.



inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)


Then, I wrote a query function where finals is a list of all the matching documents.



Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.



def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))

for term in terms:
for i in inverted_index[term]:
documents.add(i)

for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals

def finals_print(finals):
for final in finals:
display_summary(final)

finals_print(and_query("netherlands vaccine trial"))


However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.



does anyone know what i did wrong concerning my set operations??



(I think the fault should be anywhere in this part of the code):



for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals


Thanks in advance



basically what i want to do in short:



for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)


and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.










share|improve this question



















  • 1





    Could you provide some input data to be able to test the code?

    – Franco Piccolo
    Nov 11 '18 at 15:48











  • not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.

    – Jorian Onderwater
    Nov 11 '18 at 16:15











  • I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do

    – Jorian Onderwater
    Nov 11 '18 at 16:33











  • TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.

    – JL Peyret
    Nov 11 '18 at 17:34
















3















I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.



First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.



inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)


Then, I wrote a query function where finals is a list of all the matching documents.



Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.



def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))

for term in terms:
for i in inverted_index[term]:
documents.add(i)

for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals

def finals_print(finals):
for final in finals:
display_summary(final)

finals_print(and_query("netherlands vaccine trial"))


However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.



does anyone know what i did wrong concerning my set operations??



(I think the fault should be anywhere in this part of the code):



for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals


Thanks in advance



basically what i want to do in short:



for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)


and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.










share|improve this question



















  • 1





    Could you provide some input data to be able to test the code?

    – Franco Piccolo
    Nov 11 '18 at 15:48











  • not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.

    – Jorian Onderwater
    Nov 11 '18 at 16:15











  • I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do

    – Jorian Onderwater
    Nov 11 '18 at 16:33











  • TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.

    – JL Peyret
    Nov 11 '18 at 17:34














3












3








3








I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.



First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.



inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)


Then, I wrote a query function where finals is a list of all the matching documents.



Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.



def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))

for term in terms:
for i in inverted_index[term]:
documents.add(i)

for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals

def finals_print(finals):
for final in finals:
display_summary(final)

finals_print(and_query("netherlands vaccine trial"))


However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.



does anyone know what i did wrong concerning my set operations??



(I think the fault should be anywhere in this part of the code):



for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals


Thanks in advance



basically what i want to do in short:



for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)


and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.










share|improve this question
















I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.



First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.



inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)


Then, I wrote a query function where finals is a list of all the matching documents.



Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.



def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))

for term in terms:
for i in inverted_index[term]:
documents.add(i)

for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals

def finals_print(finals):
for final in finals:
display_summary(final)

finals_print(and_query("netherlands vaccine trial"))


However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.



does anyone know what i did wrong concerning my set operations??



(I think the fault should be anywhere in this part of the code):



for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals


Thanks in advance



basically what i want to do in short:



for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)


and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.







python set set-intersection






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 '18 at 16:32







Jorian Onderwater

















asked Nov 11 '18 at 15:03









Jorian OnderwaterJorian Onderwater

235




235







  • 1





    Could you provide some input data to be able to test the code?

    – Franco Piccolo
    Nov 11 '18 at 15:48











  • not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.

    – Jorian Onderwater
    Nov 11 '18 at 16:15











  • I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do

    – Jorian Onderwater
    Nov 11 '18 at 16:33











  • TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.

    – JL Peyret
    Nov 11 '18 at 17:34













  • 1





    Could you provide some input data to be able to test the code?

    – Franco Piccolo
    Nov 11 '18 at 15:48











  • not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.

    – Jorian Onderwater
    Nov 11 '18 at 16:15











  • I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do

    – Jorian Onderwater
    Nov 11 '18 at 16:33











  • TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.

    – JL Peyret
    Nov 11 '18 at 17:34








1




1





Could you provide some input data to be able to test the code?

– Franco Piccolo
Nov 11 '18 at 15:48





Could you provide some input data to be able to test the code?

– Franco Piccolo
Nov 11 '18 at 15:48













not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.

– Jorian Onderwater
Nov 11 '18 at 16:15





not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.

– Jorian Onderwater
Nov 11 '18 at 16:15













I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do

– Jorian Onderwater
Nov 11 '18 at 16:33





I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do

– Jorian Onderwater
Nov 11 '18 at 16:33













TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.

– JL Peyret
Nov 11 '18 at 17:34






TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.

– JL Peyret
Nov 11 '18 at 17:34













3 Answers
3






active

oldest

votes


















0














To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



def tokenize(abstract):
#return <set of words in abstract>
set_ = .....
return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

search_results =
for cand in candidates:
#cand[2] has a set of tokens or somesuch... abstract.
if criteria in cand[2]:
if match_on_found:
search_results.append(cand)
else:
#that's a AND NOT if you wanted that
search_results.append(cand)
return search_results


for criteria in all_criterias:
#pass in the full list every time, but it gets progressively shrunk
candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]





share|improve this answer
































    0















    Question: returns a list of matching documents for the words being in the abstracts of the documents




    The term with the min number of documents, hold always the result.

    If a term does not exists in inverted_index, gives no match at all.



    For the sake of simplicity, predefined data:



    Abstracts = 1: 'Lorem ipsum dolor sit amet,',
    2: 'consetetur sadipscing elitr,',
    3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
    4: 'sed diam voluptua.',
    5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
    6: 'Stet clita kasd gubergren,',
    7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



    inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

    def and_query(tokens):
    print("tokens:".format(tokens))
    #terms = preprocess(tokenize(tokens))
    terms = tokens.split()

    term_min = None
    for term in terms:
    if term in inverted_index:
    # Find min
    if not term_min or term_min[0] > len(inverted_index[term]):
    term_min = (len(inverted_index[term]), term)
    else:
    # Break early, if a term is not in inverted_index
    return set()

    finals = inverted_index[term_min[1]]
    print("term_min: inverted_index:".format(term_min, finals))
    return finals


    def finals_print(finals):
    if finals:
    for final in finals:
    print("Document []:".format(final, Abstracts[final]))
    else:
    print("No matching Document found")

    if __name__ == "__main__":
    for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
    finals_print(and_query(tokens))
    print()



    Output:



    tokens:sed diam voluptua.
    term_min:(1, 'voluptua.') inverted_index:4
    Document [4]:sed diam voluptua.

    tokens:Lorem ipsum dolor
    term_min:(2, 'Lorem') inverted_index:1, 7
    Document [1]:Lorem ipsum dolor sit amet,
    Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

    tokens:Lorem ipsum dolor test
    No matching Document found



    Tested with Python: 3.4.2






    share|improve this answer






























      0














      Found the solution eventually myself.
      replacing



       finals.extend(documents.intersection(id_set_for_one_word))
      return finals


      with



       documents = (documents.intersection(id_set_for_one_word))
      return documents


      seems to work here.



      Still, thanks for all the effort y'all.






      share|improve this answer






















        Your Answer






        StackExchange.ifUsing("editor", function ()
        StackExchange.using("externalEditor", function ()
        StackExchange.using("snippets", function ()
        StackExchange.snippets.init();
        );
        );
        , "code-snippets");

        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "1"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250003%2fwriting-a-and-query-for-to-find-matching-documents-within-a-dataset-python%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        0














        To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



        def tokenize(abstract):
        #return <set of words in abstract>
        set_ = .....
        return set_

        candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


        all_criterias = "netherlands vaccine trial".split()


        def searcher(candidates, criteria, match_on_found=True):

        search_results =
        for cand in candidates:
        #cand[2] has a set of tokens or somesuch... abstract.
        if criteria in cand[2]:
        if match_on_found:
        search_results.append(cand)
        else:
        #that's a AND NOT if you wanted that
        search_results.append(cand)
        return search_results


        for criteria in all_criterias:
        #pass in the full list every time, but it gets progressively shrunk
        candidates = searcher(candidates, criteria)

        #whats left is what you want
        answer = [(abs[0],abs[1]) for abs in candidates]





        share|improve this answer





























          0














          To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



          def tokenize(abstract):
          #return <set of words in abstract>
          set_ = .....
          return set_

          candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


          all_criterias = "netherlands vaccine trial".split()


          def searcher(candidates, criteria, match_on_found=True):

          search_results =
          for cand in candidates:
          #cand[2] has a set of tokens or somesuch... abstract.
          if criteria in cand[2]:
          if match_on_found:
          search_results.append(cand)
          else:
          #that's a AND NOT if you wanted that
          search_results.append(cand)
          return search_results


          for criteria in all_criterias:
          #pass in the full list every time, but it gets progressively shrunk
          candidates = searcher(candidates, criteria)

          #whats left is what you want
          answer = [(abs[0],abs[1]) for abs in candidates]





          share|improve this answer



























            0












            0








            0







            To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



            def tokenize(abstract):
            #return <set of words in abstract>
            set_ = .....
            return set_

            candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


            all_criterias = "netherlands vaccine trial".split()


            def searcher(candidates, criteria, match_on_found=True):

            search_results =
            for cand in candidates:
            #cand[2] has a set of tokens or somesuch... abstract.
            if criteria in cand[2]:
            if match_on_found:
            search_results.append(cand)
            else:
            #that's a AND NOT if you wanted that
            search_results.append(cand)
            return search_results


            for criteria in all_criterias:
            #pass in the full list every time, but it gets progressively shrunk
            candidates = searcher(candidates, criteria)

            #whats left is what you want
            answer = [(abs[0],abs[1]) for abs in candidates]





            share|improve this answer















            To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



            def tokenize(abstract):
            #return <set of words in abstract>
            set_ = .....
            return set_

            candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


            all_criterias = "netherlands vaccine trial".split()


            def searcher(candidates, criteria, match_on_found=True):

            search_results =
            for cand in candidates:
            #cand[2] has a set of tokens or somesuch... abstract.
            if criteria in cand[2]:
            if match_on_found:
            search_results.append(cand)
            else:
            #that's a AND NOT if you wanted that
            search_results.append(cand)
            return search_results


            for criteria in all_criterias:
            #pass in the full list every time, but it gets progressively shrunk
            candidates = searcher(candidates, criteria)

            #whats left is what you want
            answer = [(abs[0],abs[1]) for abs in candidates]






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 11 '18 at 18:02

























            answered Nov 11 '18 at 17:57









            JL PeyretJL Peyret

            3,0421630




            3,0421630























                0















                Question: returns a list of matching documents for the words being in the abstracts of the documents




                The term with the min number of documents, hold always the result.

                If a term does not exists in inverted_index, gives no match at all.



                For the sake of simplicity, predefined data:



                Abstracts = 1: 'Lorem ipsum dolor sit amet,',
                2: 'consetetur sadipscing elitr,',
                3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
                4: 'sed diam voluptua.',
                5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
                6: 'Stet clita kasd gubergren,',
                7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



                inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

                def and_query(tokens):
                print("tokens:".format(tokens))
                #terms = preprocess(tokenize(tokens))
                terms = tokens.split()

                term_min = None
                for term in terms:
                if term in inverted_index:
                # Find min
                if not term_min or term_min[0] > len(inverted_index[term]):
                term_min = (len(inverted_index[term]), term)
                else:
                # Break early, if a term is not in inverted_index
                return set()

                finals = inverted_index[term_min[1]]
                print("term_min: inverted_index:".format(term_min, finals))
                return finals


                def finals_print(finals):
                if finals:
                for final in finals:
                print("Document []:".format(final, Abstracts[final]))
                else:
                print("No matching Document found")

                if __name__ == "__main__":
                for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
                finals_print(and_query(tokens))
                print()



                Output:



                tokens:sed diam voluptua.
                term_min:(1, 'voluptua.') inverted_index:4
                Document [4]:sed diam voluptua.

                tokens:Lorem ipsum dolor
                term_min:(2, 'Lorem') inverted_index:1, 7
                Document [1]:Lorem ipsum dolor sit amet,
                Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

                tokens:Lorem ipsum dolor test
                No matching Document found



                Tested with Python: 3.4.2






                share|improve this answer



























                  0















                  Question: returns a list of matching documents for the words being in the abstracts of the documents




                  The term with the min number of documents, hold always the result.

                  If a term does not exists in inverted_index, gives no match at all.



                  For the sake of simplicity, predefined data:



                  Abstracts = 1: 'Lorem ipsum dolor sit amet,',
                  2: 'consetetur sadipscing elitr,',
                  3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
                  4: 'sed diam voluptua.',
                  5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
                  6: 'Stet clita kasd gubergren,',
                  7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



                  inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

                  def and_query(tokens):
                  print("tokens:".format(tokens))
                  #terms = preprocess(tokenize(tokens))
                  terms = tokens.split()

                  term_min = None
                  for term in terms:
                  if term in inverted_index:
                  # Find min
                  if not term_min or term_min[0] > len(inverted_index[term]):
                  term_min = (len(inverted_index[term]), term)
                  else:
                  # Break early, if a term is not in inverted_index
                  return set()

                  finals = inverted_index[term_min[1]]
                  print("term_min: inverted_index:".format(term_min, finals))
                  return finals


                  def finals_print(finals):
                  if finals:
                  for final in finals:
                  print("Document []:".format(final, Abstracts[final]))
                  else:
                  print("No matching Document found")

                  if __name__ == "__main__":
                  for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
                  finals_print(and_query(tokens))
                  print()



                  Output:



                  tokens:sed diam voluptua.
                  term_min:(1, 'voluptua.') inverted_index:4
                  Document [4]:sed diam voluptua.

                  tokens:Lorem ipsum dolor
                  term_min:(2, 'Lorem') inverted_index:1, 7
                  Document [1]:Lorem ipsum dolor sit amet,
                  Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

                  tokens:Lorem ipsum dolor test
                  No matching Document found



                  Tested with Python: 3.4.2






                  share|improve this answer

























                    0












                    0








                    0








                    Question: returns a list of matching documents for the words being in the abstracts of the documents




                    The term with the min number of documents, hold always the result.

                    If a term does not exists in inverted_index, gives no match at all.



                    For the sake of simplicity, predefined data:



                    Abstracts = 1: 'Lorem ipsum dolor sit amet,',
                    2: 'consetetur sadipscing elitr,',
                    3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
                    4: 'sed diam voluptua.',
                    5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
                    6: 'Stet clita kasd gubergren,',
                    7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



                    inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

                    def and_query(tokens):
                    print("tokens:".format(tokens))
                    #terms = preprocess(tokenize(tokens))
                    terms = tokens.split()

                    term_min = None
                    for term in terms:
                    if term in inverted_index:
                    # Find min
                    if not term_min or term_min[0] > len(inverted_index[term]):
                    term_min = (len(inverted_index[term]), term)
                    else:
                    # Break early, if a term is not in inverted_index
                    return set()

                    finals = inverted_index[term_min[1]]
                    print("term_min: inverted_index:".format(term_min, finals))
                    return finals


                    def finals_print(finals):
                    if finals:
                    for final in finals:
                    print("Document []:".format(final, Abstracts[final]))
                    else:
                    print("No matching Document found")

                    if __name__ == "__main__":
                    for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
                    finals_print(and_query(tokens))
                    print()



                    Output:



                    tokens:sed diam voluptua.
                    term_min:(1, 'voluptua.') inverted_index:4
                    Document [4]:sed diam voluptua.

                    tokens:Lorem ipsum dolor
                    term_min:(2, 'Lorem') inverted_index:1, 7
                    Document [1]:Lorem ipsum dolor sit amet,
                    Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

                    tokens:Lorem ipsum dolor test
                    No matching Document found



                    Tested with Python: 3.4.2






                    share|improve this answer














                    Question: returns a list of matching documents for the words being in the abstracts of the documents




                    The term with the min number of documents, hold always the result.

                    If a term does not exists in inverted_index, gives no match at all.



                    For the sake of simplicity, predefined data:



                    Abstracts = 1: 'Lorem ipsum dolor sit amet,',
                    2: 'consetetur sadipscing elitr,',
                    3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
                    4: 'sed diam voluptua.',
                    5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
                    6: 'Stet clita kasd gubergren,',
                    7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



                    inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

                    def and_query(tokens):
                    print("tokens:".format(tokens))
                    #terms = preprocess(tokenize(tokens))
                    terms = tokens.split()

                    term_min = None
                    for term in terms:
                    if term in inverted_index:
                    # Find min
                    if not term_min or term_min[0] > len(inverted_index[term]):
                    term_min = (len(inverted_index[term]), term)
                    else:
                    # Break early, if a term is not in inverted_index
                    return set()

                    finals = inverted_index[term_min[1]]
                    print("term_min: inverted_index:".format(term_min, finals))
                    return finals


                    def finals_print(finals):
                    if finals:
                    for final in finals:
                    print("Document []:".format(final, Abstracts[final]))
                    else:
                    print("No matching Document found")

                    if __name__ == "__main__":
                    for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
                    finals_print(and_query(tokens))
                    print()



                    Output:



                    tokens:sed diam voluptua.
                    term_min:(1, 'voluptua.') inverted_index:4
                    Document [4]:sed diam voluptua.

                    tokens:Lorem ipsum dolor
                    term_min:(2, 'Lorem') inverted_index:1, 7
                    Document [1]:Lorem ipsum dolor sit amet,
                    Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

                    tokens:Lorem ipsum dolor test
                    No matching Document found



                    Tested with Python: 3.4.2







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 11 '18 at 20:02









                    stovflstovfl

                    7,67231031




                    7,67231031





















                        0














                        Found the solution eventually myself.
                        replacing



                         finals.extend(documents.intersection(id_set_for_one_word))
                        return finals


                        with



                         documents = (documents.intersection(id_set_for_one_word))
                        return documents


                        seems to work here.



                        Still, thanks for all the effort y'all.






                        share|improve this answer



























                          0














                          Found the solution eventually myself.
                          replacing



                           finals.extend(documents.intersection(id_set_for_one_word))
                          return finals


                          with



                           documents = (documents.intersection(id_set_for_one_word))
                          return documents


                          seems to work here.



                          Still, thanks for all the effort y'all.






                          share|improve this answer

























                            0












                            0








                            0







                            Found the solution eventually myself.
                            replacing



                             finals.extend(documents.intersection(id_set_for_one_word))
                            return finals


                            with



                             documents = (documents.intersection(id_set_for_one_word))
                            return documents


                            seems to work here.



                            Still, thanks for all the effort y'all.






                            share|improve this answer













                            Found the solution eventually myself.
                            replacing



                             finals.extend(documents.intersection(id_set_for_one_word))
                            return finals


                            with



                             documents = (documents.intersection(id_set_for_one_word))
                            return documents


                            seems to work here.



                            Still, thanks for all the effort y'all.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 12 '18 at 9:34









                            Jorian OnderwaterJorian Onderwater

                            235




                            235



























                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250003%2fwriting-a-and-query-for-to-find-matching-documents-within-a-dataset-python%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

                                ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế

                                ⃀⃉⃄⃅⃍,⃂₼₡₰⃉₡₿₢⃉₣⃄₯⃊₮₼₹₱₦₷⃄₪₼₶₳₫⃍₽ ₫₪₦⃆₠₥⃁₸₴₷⃊₹⃅⃈₰⃁₫ ⃎⃍₩₣₷ ₻₮⃊⃀⃄⃉₯,⃏⃊,₦⃅₪,₼⃀₾₧₷₾ ₻ ₸₡ ₾,₭⃈₴⃋,€⃁,₩ ₺⃌⃍⃁₱⃋⃋₨⃊⃁⃃₼,⃎,₱⃍₲₶₡ ⃍⃅₶₨₭,⃉₭₾₡₻⃀ ₼₹⃅₹,₻₭ ⃌