Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series

Pandas groupby.size vs series.value_counts vs collections.Counter with multiple series



There are many questions (1, 2, 3) dealing with counting values in a single series.



However, there are fewer questions looking at the best way to count combinations of two or more series. Solutions are presented (1, 2), but when and why one should use each is not discussed.



Below is some benchmarking for three potential methods. I have two specific questions:


grouper


count


count


grouper


value_counter


grouper



I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy arrays versus a dictionary comprehension:


numpy


x, z = grouper(df), count(df)
%timeit x[x.values > 10] # 749µs
%timeit k: v for k, v in z.items() if v > 10 # 9.37ms



However, the focus of my question is on performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.



Benchmarking code


import pandas as pd
import numpy as np
from collections import Counter

np.random.seed(0)

m, n = 1000, 100000

df = pd.DataFrame('A': np.random.randint(0, m, n),
'B': np.random.randint(0, m, n))

def grouper(df):
return df.groupby(['A', 'B'], sort=False).size()

def value_counter(df):
return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)

def count(df):
return Counter(zip(df.A.values, df.B.values))

x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)

assert (x == y) & (y == z), "Dictionary mismatch!"

for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:

df = pd.DataFrame('A': np.random.randint(0, m, n),
'B': np.random.randint(0, m, n))

print(m, n)

%timeit grouper(df)
%timeit value_counter(df)
%timeit count(df)



Benchmarking results



Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1



Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.



Key: g = grouper, v = value_counter, c = count.


grouper


value_counter


count


m n g v c
100 10000 2.91 18.30 8.41
1000 10000 4.10 27.20 6.98[1]
100 100000 17.90 130.00 84.50
1000 100000 43.90 309.00 93.50



1 This is not a typo.





a small sidebar - pd.Series(list(zip(df.A, df.B))).value_counts(sort=False) improves a little - so I am assuming the sorting to contribute as an overhead in addition to the list casting
– Vivek Kalyanarangan
May 14 at 10:54



pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)


list





I'm not surprised at all that the function tailor-made for this exact use case performs best. pandas knows far more about the structure of its data than Counter does. in addition, pandas probably is much less memory intensive since it knows how to reuse its existing memory.
– BallpointBen
May 17 at 17:16


pandas


Counter


pandas





@BallpointBen, From a philosophical point of view, your comment makes perfect sense. Can you pinpoint the specific underlying reasons (e.g. hashing, cost of iteration, etc) with reference to the source code?
– jpp
May 17 at 17:17






Also, for an even more performant version of groupby, pass sort=False to groupby.
– BallpointBen
May 17 at 17:24


groupby


sort=False


groupby





@Parfait, Updated with (a) np.random.seed(0), (b) later versions of Python / numpy / pandas + included machine specs, (c) sort=False for pandas methods.
– jpp
May 17 at 19:31


np.random.seed(0)


sort=False


pandas




1 Answer
1



There's actually a bit of hidden overhead in zip(df.A.values, df.B.values). The key here comes down to numpy arrays being stored in memory in a fundamentally different way than Python objects.


zip(df.A.values, df.B.values)



A numpy array, such as np.arange(10), is essentially stored as a contiguous block of memory, and not as individual Python objects. Conversely, a Python list, such as list(range(10)), is stored in memory as pointers to individual Python objects (i.e. integers 0-9). This difference is the basis for why numpy arrays are smaller in memory than the Python equivalent lists, and why you can perform faster computations on numpy arrays.


np.arange(10)


list(range(10))



So, as Counter is consuming the zip, the associated tuples need to be created as Python objects. This means that Python needs to extract the tuple values from numpy data and create corresponding Python objects in memory. There is noticeable overhead to this, which is why you want to be very careful when combining pure Python functions with numpy data. A basic example of this pitfall that you might commonly see is using the built-in Python sum on a numpy array: sum(np.arange(10**5)) is actually a bit slower than the pure Python sum(range(10**5)), and both of which are of course significantly slower than np.sum(np.arange(10**5)).


Counter


zip


sum


sum(np.arange(10**5))


sum(range(10**5))


np.sum(np.arange(10**5))



See this video for a more in depth discussion of this topic.



As an example specific to this question, observe the following timings comparing the performance of Counter on zipped numpy arrays vs. the corresponding zipped Python lists.


Counter


In [2]: a = np.random.randint(10**4, size=10**6)
...: b = np.random.randint(10**4, size=10**6)
...: a_list = a.tolist()
...: b_list = b.tolist()

In [3]: %timeit Counter(zip(a, b))
455 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit Counter(zip(a_list, b_list))
334 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



The difference between these two timings gives you a reasonable estimate of the overhead discussed earlier.



This isn't quite the end of the story though. Constructing a groupby object in pandas involves a some overhead too, at least as related to this problem, since there's some groupby metadata that isn't strictly necessary just to get size, whereas Counter does the one singular thing you care about. Usually this overhead is far less than the overhead associated with Counter, but from some quick experimentation I've found that you can actually get marginally better performance from Counter when the majority of your groups just consist of single elements.


groupby


groupby


size


Counter


Counter


Counter



Consider the following timings (using @BallpointBen's sort=False suggestion) that go along the spectrum of few large groups <--> many small groups:


sort=False


def grouper(df):
return df.groupby(['A', 'B'], sort=False).size()

def count(df):
return Counter(zip(df.A.values, df.B.values))

for m, n in [(10, 10**6), (10**3, 10**6), (10**7, 10**6)]:

df = pd.DataFrame('A': np.random.randint(0, m, n),
'B': np.random.randint(0, m, n))

print(m, n)

%timeit grouper(df)
%timeit count(df)



Which gives me the following table:


m grouper counter
10 62.9 ms 315 ms
10**3 191 ms 535 ms
10**7 514 ms 459 ms



Of course, any gains from Counter would be offset by converting back to a Series, if that's what you want as your final object.


Counter


Series





Excellent answer and supplementary timings, thanks. One question, do you have a reference for when materializing the zip you're creating tuples of Python objects? I thought that tuple objects are only produced when you call list, next, etc. But I wasn't aware that tuples are created internally before being consumed by Counter.
– jpp
May 17 at 22:29



when materializing the zip you're creating tuples of Python objects


list


next


tuples


Counter





Unclear wording on my part, I meant that as Counter is consuming the zip, the associated tuples need to be created in memory. So the tuples are being created while being consumed by Counter. Basically Counter iterates over the zip in a for loop, so during each iteration of the loop the associated tuple from the zip needs to be created. This _count_elements function (or a C equivalent) is essentially how the Counter is counting things.
– root
May 17 at 22:59


Counter


zip


Counter


Counter


zip


for


zip


_count_elements


Counter




Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



Would you like to answer one of these unanswered questions instead?

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)