Python Pandas Create New Column with Groupby().Sum()

Trying to create a new column with the groupby calculation. In the code below, I get the correct calculated values for each date (see group below) but when I try to create a new column (df['Data4']) with it I get NaN. So I am trying to create a new column in the dataframe with the sum of 'Data3' for the all dates and apply that to each date row. For example, 2015-05-08 is in 2 rows (total is 50+5 = 55) and in this new column I would like to have 55 in both of the rows.

import pandas as pd
import numpy as np
from pandas import DataFrame


df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

group = df['Data3'].groupby(df['Date']).sum()

df['Data4'] = group

asked May 14 '15 at 18:44

fe ner

209136

add a comment |

import pandas as pd
import numpy as np
from pandas import DataFrame


df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

group = df['Data3'].groupby(df['Date']).sum()

df['Data4'] = group

asked May 14 '15 at 18:44

fe ner

209136

add a comment |

import pandas as pd
import numpy as np
from pandas import DataFrame


df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

group = df['Data3'].groupby(df['Date']).sum()

df['Data4'] = group

asked May 14 '15 at 18:44

fe ner

209136

import pandas as pd
import numpy as np
from pandas import DataFrame


df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

group = df['Data3'].groupby(df['Date']).sum()

df['Data4'] = group

python pandas

asked May 14 '15 at 18:44

fe ner

209136

asked May 14 '15 at 18:44

fe ner

209136

asked May 14 '15 at 18:44

fe ner

209136

asked May 14 '15 at 18:44

fe ner

209136

asked May 14 '15 at 18:44

fe ner

209136

add a comment |

2 Answers
2

active

oldest

votes

129

You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column:

In [74]:

df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
 Data2 Data3 Date Sym Data4
0 11 5 2015-05-08 aapl 55
1 8 8 2015-05-07 aapl 108
2 10 6 2015-05-06 aapl 66
3 15 1 2015-05-05 aapl 121
4 110 50 2015-05-08 aaww 55
5 60 100 2015-05-07 aaww 108
6 100 60 2015-05-06 aaww 66
7 40 120 2015-05-05 aaww 121

answered May 14 '15 at 18:46

EdChum

176k33376325

What happens if we have a second groupby as in here: stackoverflow.com/a/40067099/281545

– Mr_and_Mrs_D
May 5 '18 at 20:40

@Mr_and_Mrs_D you'd have to reset the index and perform a left merge on the common columns in that case to add the column back

– EdChum
May 5 '18 at 20:56

3

Alternatively, one can use df.groupby('Date')['Data3'].transform('sum') (which I find slightly easier to remember).

– Cleb
Aug 24 '18 at 11:32

add a comment |

I stumbled upon an interesting idiosyncrasy in the API. It seems like you consistently can shave off a few milliseconds of the time taken by transform if you instead use a direct function of GroupBy and broadcast it using map:

df
 Date Sym Data2 Data3
0 2015-05-08 aapl 11 5
1 2015-05-07 aapl 8 8
2 2015-05-06 aapl 10 6
3 2015-05-05 aapl 15 1
4 2015-05-08 aaww 110 50
5 2015-05-07 aaww 60 100
6 2015-05-06 aaww 100 60
7 2015-05-05 aaww 40 120

df.Date.map(df.groupby('Date')['Data3'].sum())

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Date, dtype: int64

Compare with

df.groupby('Date')['Data3'].transform('sum')

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64

My tests show that map is a bit faster if you can afford to use the direct GroupBy function (such as mean, min, max, first, etc). It is more or less faster for most general situations upto around ~200 thousand records. After that, the performance really depends on the data.

enter image description here

I would say this is a nice alternative to know, and is better if you have smaller frames with smaller numbers of groups, but I would recommend transform as a first choice. Thought this was worth sharing anyway.

Benchmarking code, for reference:

import perfplot

perfplot.show(
 setup=lambda n: pd.DataFrame('A': np.random.choice(n//10, n), 'B': np.ones(n)),
 kernels=[
 lambda df: df.groupby('A')['B'].transform('sum'),
 lambda df: df.A.map(df.groupby('A')['B'].sum()),
 ],
 labels=['GroupBy.transform', 'GroupBy.sum + map'],
 n_range=[2**k for k in range(5, 20)],
 xlabel='N',
 logy=True,
 logx=True
)

edited Jan 29 at 9:40

answered Jan 29 at 9:09

coldspeed

130k23135221

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f30244952%2fpython-pandas-create-new-column-with-groupby-sum%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

129

You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column:

In [74]:

df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
 Data2 Data3 Date Sym Data4
0 11 5 2015-05-08 aapl 55
1 8 8 2015-05-07 aapl 108
2 10 6 2015-05-06 aapl 66
3 15 1 2015-05-05 aapl 121
4 110 50 2015-05-08 aaww 55
5 60 100 2015-05-07 aaww 108
6 100 60 2015-05-06 aaww 66
7 40 120 2015-05-05 aaww 121

answered May 14 '15 at 18:46

EdChum

176k33376325

What happens if we have a second groupby as in here: stackoverflow.com/a/40067099/281545

– Mr_and_Mrs_D
May 5 '18 at 20:40

@Mr_and_Mrs_D you'd have to reset the index and perform a left merge on the common columns in that case to add the column back

– EdChum
May 5 '18 at 20:56

3

Alternatively, one can use df.groupby('Date')['Data3'].transform('sum') (which I find slightly easier to remember).

– Cleb
Aug 24 '18 at 11:32

add a comment |

129

You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column:

In [74]:

df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
 Data2 Data3 Date Sym Data4
0 11 5 2015-05-08 aapl 55
1 8 8 2015-05-07 aapl 108
2 10 6 2015-05-06 aapl 66
3 15 1 2015-05-05 aapl 121
4 110 50 2015-05-08 aaww 55
5 60 100 2015-05-07 aaww 108
6 100 60 2015-05-06 aaww 66
7 40 120 2015-05-05 aaww 121

answered May 14 '15 at 18:46

EdChum

176k33376325

What happens if we have a second groupby as in here: stackoverflow.com/a/40067099/281545

– Mr_and_Mrs_D
May 5 '18 at 20:40

@Mr_and_Mrs_D you'd have to reset the index and perform a left merge on the common columns in that case to add the column back

– EdChum
May 5 '18 at 20:56

3

Alternatively, one can use df.groupby('Date')['Data3'].transform('sum') (which I find slightly easier to remember).

– Cleb
Aug 24 '18 at 11:32

add a comment |

129

You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column:

In [74]:

df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
 Data2 Data3 Date Sym Data4
0 11 5 2015-05-08 aapl 55
1 8 8 2015-05-07 aapl 108
2 10 6 2015-05-06 aapl 66
3 15 1 2015-05-05 aapl 121
4 110 50 2015-05-08 aaww 55
5 60 100 2015-05-07 aaww 108
6 100 60 2015-05-06 aaww 66
7 40 120 2015-05-05 aaww 121

answered May 14 '15 at 18:46

EdChum

176k33376325

You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column:

In [74]:

df = pd.DataFrame('Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120])

df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
 Data2 Data3 Date Sym Data4
0 11 5 2015-05-08 aapl 55
1 8 8 2015-05-07 aapl 108
2 10 6 2015-05-06 aapl 66
3 15 1 2015-05-05 aapl 121
4 110 50 2015-05-08 aaww 55
5 60 100 2015-05-07 aaww 108
6 100 60 2015-05-06 aaww 66
7 40 120 2015-05-05 aaww 121

answered May 14 '15 at 18:46

EdChum

176k33376325

answered May 14 '15 at 18:46

EdChum

176k33376325

answered May 14 '15 at 18:46

EdChum

176k33376325

answered May 14 '15 at 18:46

EdChum

176k33376325

What happens if we have a second groupby as in here: stackoverflow.com/a/40067099/281545

– Mr_and_Mrs_D
May 5 '18 at 20:40

@Mr_and_Mrs_D you'd have to reset the index and perform a left merge on the common columns in that case to add the column back

– EdChum
May 5 '18 at 20:56

3

Alternatively, one can use df.groupby('Date')['Data3'].transform('sum') (which I find slightly easier to remember).

– Cleb
Aug 24 '18 at 11:32

add a comment |

What happens if we have a second groupby as in here: stackoverflow.com/a/40067099/281545

– Mr_and_Mrs_D
May 5 '18 at 20:40

@Mr_and_Mrs_D you'd have to reset the index and perform a left merge on the common columns in that case to add the column back

– EdChum
May 5 '18 at 20:56

3

Alternatively, one can use df.groupby('Date')['Data3'].transform('sum') (which I find slightly easier to remember).

– Cleb
Aug 24 '18 at 11:32

What happens if we have a second groupby as in here: stackoverflow.com/a/40067099/281545

– Mr_and_Mrs_D
May 5 '18 at 20:40

@Mr_and_Mrs_D you'd have to reset the index and perform a left merge on the common columns in that case to add the column back

– EdChum
May 5 '18 at 20:56

Alternatively, one can use df.groupby('Date')['Data3'].transform('sum') (which I find slightly easier to remember).

– Cleb
Aug 24 '18 at 11:32

add a comment |

df
 Date Sym Data2 Data3
0 2015-05-08 aapl 11 5
1 2015-05-07 aapl 8 8
2 2015-05-06 aapl 10 6
3 2015-05-05 aapl 15 1
4 2015-05-08 aaww 110 50
5 2015-05-07 aaww 60 100
6 2015-05-06 aaww 100 60
7 2015-05-05 aaww 40 120

df.Date.map(df.groupby('Date')['Data3'].sum())

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Date, dtype: int64

Compare with

df.groupby('Date')['Data3'].transform('sum')

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64

enter image description here

Benchmarking code, for reference:

import perfplot

perfplot.show(
 setup=lambda n: pd.DataFrame('A': np.random.choice(n//10, n), 'B': np.ones(n)),
 kernels=[
 lambda df: df.groupby('A')['B'].transform('sum'),
 lambda df: df.A.map(df.groupby('A')['B'].sum()),
 ],
 labels=['GroupBy.transform', 'GroupBy.sum + map'],
 n_range=[2**k for k in range(5, 20)],
 xlabel='N',
 logy=True,
 logx=True
)

edited Jan 29 at 9:40

answered Jan 29 at 9:09

coldspeed

130k23135221

add a comment |

df
 Date Sym Data2 Data3
0 2015-05-08 aapl 11 5
1 2015-05-07 aapl 8 8
2 2015-05-06 aapl 10 6
3 2015-05-05 aapl 15 1
4 2015-05-08 aaww 110 50
5 2015-05-07 aaww 60 100
6 2015-05-06 aaww 100 60
7 2015-05-05 aaww 40 120

df.Date.map(df.groupby('Date')['Data3'].sum())

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Date, dtype: int64

Compare with

df.groupby('Date')['Data3'].transform('sum')

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64

enter image description here

Benchmarking code, for reference:

import perfplot

perfplot.show(
 setup=lambda n: pd.DataFrame('A': np.random.choice(n//10, n), 'B': np.ones(n)),
 kernels=[
 lambda df: df.groupby('A')['B'].transform('sum'),
 lambda df: df.A.map(df.groupby('A')['B'].sum()),
 ],
 labels=['GroupBy.transform', 'GroupBy.sum + map'],
 n_range=[2**k for k in range(5, 20)],
 xlabel='N',
 logy=True,
 logx=True
)

edited Jan 29 at 9:40

answered Jan 29 at 9:09

coldspeed

130k23135221

add a comment |

df
 Date Sym Data2 Data3
0 2015-05-08 aapl 11 5
1 2015-05-07 aapl 8 8
2 2015-05-06 aapl 10 6
3 2015-05-05 aapl 15 1
4 2015-05-08 aaww 110 50
5 2015-05-07 aaww 60 100
6 2015-05-06 aaww 100 60
7 2015-05-05 aaww 40 120

df.Date.map(df.groupby('Date')['Data3'].sum())

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Date, dtype: int64

Compare with

df.groupby('Date')['Data3'].transform('sum')

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64

enter image description here

Benchmarking code, for reference:

import perfplot

perfplot.show(
 setup=lambda n: pd.DataFrame('A': np.random.choice(n//10, n), 'B': np.ones(n)),
 kernels=[
 lambda df: df.groupby('A')['B'].transform('sum'),
 lambda df: df.A.map(df.groupby('A')['B'].sum()),
 ],
 labels=['GroupBy.transform', 'GroupBy.sum + map'],
 n_range=[2**k for k in range(5, 20)],
 xlabel='N',
 logy=True,
 logx=True
)

edited Jan 29 at 9:40

answered Jan 29 at 9:09

coldspeed

130k23135221

df
 Date Sym Data2 Data3
0 2015-05-08 aapl 11 5
1 2015-05-07 aapl 8 8
2 2015-05-06 aapl 10 6
3 2015-05-05 aapl 15 1
4 2015-05-08 aaww 110 50
5 2015-05-07 aaww 60 100
6 2015-05-06 aaww 100 60
7 2015-05-05 aaww 40 120

df.Date.map(df.groupby('Date')['Data3'].sum())

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Date, dtype: int64

Compare with

df.groupby('Date')['Data3'].transform('sum')

0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64

enter image description here

Benchmarking code, for reference:

import perfplot

perfplot.show(
 setup=lambda n: pd.DataFrame('A': np.random.choice(n//10, n), 'B': np.ones(n)),
 kernels=[
 lambda df: df.groupby('A')['B'].transform('sum'),
 lambda df: df.A.map(df.groupby('A')['B'].sum()),
 ],
 labels=['GroupBy.transform', 'GroupBy.sum + map'],
 n_range=[2**k for k in range(5, 20)],
 xlabel='N',
 logy=True,
 logx=True
)

edited Jan 29 at 9:40

answered Jan 29 at 9:09

coldspeed

130k23135221

edited Jan 29 at 9:40

answered Jan 29 at 9:09

coldspeed

130k23135221

answered Jan 29 at 9:09

coldspeed

130k23135221

answered Jan 29 at 9:09

coldspeed

130k23135221

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt