Pandas groupby the same column multiple times based on different column values










2















I have a pandas DataFrame that is generated by this snippet:



elig = pd.DataFrame('memberid': [1,1,1,1,1,1,2],
'monthid': [201711, 201712, 201801, 201805, 201806, 201807, 201810])


and I would like to perform a .groupby operation on memberid based on continuous values of monthid, e.g., I would like the (very) end result to be a table looking like this:



memberid | start_month | end_month

1 | 201711 | 201801
1 | 201805 | 201807
2 | 201810 | 201810


I was wondering whether there is an idiomatic Pandas way to do this. So far I have tried a convoluted method, defining a new_elig = defaultdict(list) and then an outside function:



def f(x):
global new_elig
new_elig[x.iloc[0]['memberid']].append(x.iloc[0]['monthid'])


and finally



elig.groupby('memberid')[['memberid', 'monthid']].apply(f)


which takes about 5 minutes for ~700k rows in the original DataFrame in order to create new_elig, which then I have to manually inspect for each memberid so as to get the continuous ranges.



Is there a better way? There has to be one :/










share|improve this question


























    2















    I have a pandas DataFrame that is generated by this snippet:



    elig = pd.DataFrame('memberid': [1,1,1,1,1,1,2],
    'monthid': [201711, 201712, 201801, 201805, 201806, 201807, 201810])


    and I would like to perform a .groupby operation on memberid based on continuous values of monthid, e.g., I would like the (very) end result to be a table looking like this:



    memberid | start_month | end_month

    1 | 201711 | 201801
    1 | 201805 | 201807
    2 | 201810 | 201810


    I was wondering whether there is an idiomatic Pandas way to do this. So far I have tried a convoluted method, defining a new_elig = defaultdict(list) and then an outside function:



    def f(x):
    global new_elig
    new_elig[x.iloc[0]['memberid']].append(x.iloc[0]['monthid'])


    and finally



    elig.groupby('memberid')[['memberid', 'monthid']].apply(f)


    which takes about 5 minutes for ~700k rows in the original DataFrame in order to create new_elig, which then I have to manually inspect for each memberid so as to get the continuous ranges.



    Is there a better way? There has to be one :/










    share|improve this question
























      2












      2








      2








      I have a pandas DataFrame that is generated by this snippet:



      elig = pd.DataFrame('memberid': [1,1,1,1,1,1,2],
      'monthid': [201711, 201712, 201801, 201805, 201806, 201807, 201810])


      and I would like to perform a .groupby operation on memberid based on continuous values of monthid, e.g., I would like the (very) end result to be a table looking like this:



      memberid | start_month | end_month

      1 | 201711 | 201801
      1 | 201805 | 201807
      2 | 201810 | 201810


      I was wondering whether there is an idiomatic Pandas way to do this. So far I have tried a convoluted method, defining a new_elig = defaultdict(list) and then an outside function:



      def f(x):
      global new_elig
      new_elig[x.iloc[0]['memberid']].append(x.iloc[0]['monthid'])


      and finally



      elig.groupby('memberid')[['memberid', 'monthid']].apply(f)


      which takes about 5 minutes for ~700k rows in the original DataFrame in order to create new_elig, which then I have to manually inspect for each memberid so as to get the continuous ranges.



      Is there a better way? There has to be one :/










      share|improve this question














      I have a pandas DataFrame that is generated by this snippet:



      elig = pd.DataFrame('memberid': [1,1,1,1,1,1,2],
      'monthid': [201711, 201712, 201801, 201805, 201806, 201807, 201810])


      and I would like to perform a .groupby operation on memberid based on continuous values of monthid, e.g., I would like the (very) end result to be a table looking like this:



      memberid | start_month | end_month

      1 | 201711 | 201801
      1 | 201805 | 201807
      2 | 201810 | 201810


      I was wondering whether there is an idiomatic Pandas way to do this. So far I have tried a convoluted method, defining a new_elig = defaultdict(list) and then an outside function:



      def f(x):
      global new_elig
      new_elig[x.iloc[0]['memberid']].append(x.iloc[0]['monthid'])


      and finally



      elig.groupby('memberid')[['memberid', 'monthid']].apply(f)


      which takes about 5 minutes for ~700k rows in the original DataFrame in order to create new_elig, which then I have to manually inspect for each memberid so as to get the continuous ranges.



      Is there a better way? There has to be one :/







      python pandas pandas-groupby






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 12 '18 at 22:40









      nvergosnvergos

      829




      829






















          1 Answer
          1






          active

          oldest

          votes


















          0














          Here is one method that I hope is fast enough for your needs. it involves some manual arithmetic on years and months. That feels dirty, but I think this is faster than converting the monthid column to a Datetime Series with pd.to_datetime(elig['monthid'], format='%Y%m'), etc.



          # Get the four-digit year with floor division

          elig['year'] = elig['monthid']//100
          elig['month'] = elig['monthid'] - elig['year']*100


          # Boolean mask 1:
          # If current row minus previous row is NOT 1 month, flag the row with True.
          # Boolean mask 2:
          # If months are contiguous (thus slipping past mask 1)
          # but memberid changes, flag the row with True.
          # (This does not occur in your example data.)

          mask1 = (elig['year']*12 + elig['month']).diff() != 1
          mask2 = elig['memberid'] != elig['memberid'].shift()


          # Convert the flag column to integer and take the cumulative sum.
          # This converts the boolean flags into a column that assigns a
          # unique integer to each contiguous run of consecutive months belonging
          # to the same memberid.

          elig['run_id'] = (mask1 | mask2).astype(int).cumsum()

          res = (
          elig.groupby('run_id')
          .agg('memberid': 'first', 'monthid': ['first', 'last'])
          .reset_index(drop=True)
          )
          res.columns = ['memberid', 'start_month', 'end_month']

          res
          memberid start_month end_month
          0 1 201711 201801
          1 1 201805 201807
          2 2 201810 201810





          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271131%2fpandas-groupby-the-same-column-multiple-times-based-on-different-column-values%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Here is one method that I hope is fast enough for your needs. it involves some manual arithmetic on years and months. That feels dirty, but I think this is faster than converting the monthid column to a Datetime Series with pd.to_datetime(elig['monthid'], format='%Y%m'), etc.



            # Get the four-digit year with floor division

            elig['year'] = elig['monthid']//100
            elig['month'] = elig['monthid'] - elig['year']*100


            # Boolean mask 1:
            # If current row minus previous row is NOT 1 month, flag the row with True.
            # Boolean mask 2:
            # If months are contiguous (thus slipping past mask 1)
            # but memberid changes, flag the row with True.
            # (This does not occur in your example data.)

            mask1 = (elig['year']*12 + elig['month']).diff() != 1
            mask2 = elig['memberid'] != elig['memberid'].shift()


            # Convert the flag column to integer and take the cumulative sum.
            # This converts the boolean flags into a column that assigns a
            # unique integer to each contiguous run of consecutive months belonging
            # to the same memberid.

            elig['run_id'] = (mask1 | mask2).astype(int).cumsum()

            res = (
            elig.groupby('run_id')
            .agg('memberid': 'first', 'monthid': ['first', 'last'])
            .reset_index(drop=True)
            )
            res.columns = ['memberid', 'start_month', 'end_month']

            res
            memberid start_month end_month
            0 1 201711 201801
            1 1 201805 201807
            2 2 201810 201810





            share|improve this answer





























              0














              Here is one method that I hope is fast enough for your needs. it involves some manual arithmetic on years and months. That feels dirty, but I think this is faster than converting the monthid column to a Datetime Series with pd.to_datetime(elig['monthid'], format='%Y%m'), etc.



              # Get the four-digit year with floor division

              elig['year'] = elig['monthid']//100
              elig['month'] = elig['monthid'] - elig['year']*100


              # Boolean mask 1:
              # If current row minus previous row is NOT 1 month, flag the row with True.
              # Boolean mask 2:
              # If months are contiguous (thus slipping past mask 1)
              # but memberid changes, flag the row with True.
              # (This does not occur in your example data.)

              mask1 = (elig['year']*12 + elig['month']).diff() != 1
              mask2 = elig['memberid'] != elig['memberid'].shift()


              # Convert the flag column to integer and take the cumulative sum.
              # This converts the boolean flags into a column that assigns a
              # unique integer to each contiguous run of consecutive months belonging
              # to the same memberid.

              elig['run_id'] = (mask1 | mask2).astype(int).cumsum()

              res = (
              elig.groupby('run_id')
              .agg('memberid': 'first', 'monthid': ['first', 'last'])
              .reset_index(drop=True)
              )
              res.columns = ['memberid', 'start_month', 'end_month']

              res
              memberid start_month end_month
              0 1 201711 201801
              1 1 201805 201807
              2 2 201810 201810





              share|improve this answer



























                0












                0








                0







                Here is one method that I hope is fast enough for your needs. it involves some manual arithmetic on years and months. That feels dirty, but I think this is faster than converting the monthid column to a Datetime Series with pd.to_datetime(elig['monthid'], format='%Y%m'), etc.



                # Get the four-digit year with floor division

                elig['year'] = elig['monthid']//100
                elig['month'] = elig['monthid'] - elig['year']*100


                # Boolean mask 1:
                # If current row minus previous row is NOT 1 month, flag the row with True.
                # Boolean mask 2:
                # If months are contiguous (thus slipping past mask 1)
                # but memberid changes, flag the row with True.
                # (This does not occur in your example data.)

                mask1 = (elig['year']*12 + elig['month']).diff() != 1
                mask2 = elig['memberid'] != elig['memberid'].shift()


                # Convert the flag column to integer and take the cumulative sum.
                # This converts the boolean flags into a column that assigns a
                # unique integer to each contiguous run of consecutive months belonging
                # to the same memberid.

                elig['run_id'] = (mask1 | mask2).astype(int).cumsum()

                res = (
                elig.groupby('run_id')
                .agg('memberid': 'first', 'monthid': ['first', 'last'])
                .reset_index(drop=True)
                )
                res.columns = ['memberid', 'start_month', 'end_month']

                res
                memberid start_month end_month
                0 1 201711 201801
                1 1 201805 201807
                2 2 201810 201810





                share|improve this answer















                Here is one method that I hope is fast enough for your needs. it involves some manual arithmetic on years and months. That feels dirty, but I think this is faster than converting the monthid column to a Datetime Series with pd.to_datetime(elig['monthid'], format='%Y%m'), etc.



                # Get the four-digit year with floor division

                elig['year'] = elig['monthid']//100
                elig['month'] = elig['monthid'] - elig['year']*100


                # Boolean mask 1:
                # If current row minus previous row is NOT 1 month, flag the row with True.
                # Boolean mask 2:
                # If months are contiguous (thus slipping past mask 1)
                # but memberid changes, flag the row with True.
                # (This does not occur in your example data.)

                mask1 = (elig['year']*12 + elig['month']).diff() != 1
                mask2 = elig['memberid'] != elig['memberid'].shift()


                # Convert the flag column to integer and take the cumulative sum.
                # This converts the boolean flags into a column that assigns a
                # unique integer to each contiguous run of consecutive months belonging
                # to the same memberid.

                elig['run_id'] = (mask1 | mask2).astype(int).cumsum()

                res = (
                elig.groupby('run_id')
                .agg('memberid': 'first', 'monthid': ['first', 'last'])
                .reset_index(drop=True)
                )
                res.columns = ['memberid', 'start_month', 'end_month']

                res
                memberid start_month end_month
                0 1 201711 201801
                1 1 201805 201807
                2 2 201810 201810






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 14 '18 at 15:29

























                answered Nov 13 '18 at 0:10









                Peter LeimbiglerPeter Leimbigler

                4,5231416




                4,5231416





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271131%2fpandas-groupby-the-same-column-multiple-times-based-on-different-column-values%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

                    ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế

                    ⃀⃉⃄⃅⃍,⃂₼₡₰⃉₡₿₢⃉₣⃄₯⃊₮₼₹₱₦₷⃄₪₼₶₳₫⃍₽ ₫₪₦⃆₠₥⃁₸₴₷⃊₹⃅⃈₰⃁₫ ⃎⃍₩₣₷ ₻₮⃊⃀⃄⃉₯,⃏⃊,₦⃅₪,₼⃀₾₧₷₾ ₻ ₸₡ ₾,₭⃈₴⃋,€⃁,₩ ₺⃌⃍⃁₱⃋⃋₨⃊⃁⃃₼,⃎,₱⃍₲₶₡ ⃍⃅₶₨₭,⃉₭₾₡₻⃀ ₼₹⃅₹,₻₭ ⃌