How to prevent pandas resample from resampling id columns









up vote
0
down vote

favorite












I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.



>>>print(df.head())
site_id type_id equipment_id timestamp value
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868


I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code



>>> df = df 
... .set_index(['timestamp'])
... .sort_values(['site_id','type_id','equipment_id','timestamp'])
... .groupby(['site_id','type_id','equipment_id'])
... .resample('15T')
... .mean()


I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?



 site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763


Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?










share|improve this question























  • I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
    – Kris
    Nov 10 at 22:12














up vote
0
down vote

favorite












I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.



>>>print(df.head())
site_id type_id equipment_id timestamp value
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868


I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code



>>> df = df 
... .set_index(['timestamp'])
... .sort_values(['site_id','type_id','equipment_id','timestamp'])
... .groupby(['site_id','type_id','equipment_id'])
... .resample('15T')
... .mean()


I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?



 site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763


Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?










share|improve this question























  • I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
    – Kris
    Nov 10 at 22:12












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.



>>>print(df.head())
site_id type_id equipment_id timestamp value
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868


I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code



>>> df = df 
... .set_index(['timestamp'])
... .sort_values(['site_id','type_id','equipment_id','timestamp'])
... .groupby(['site_id','type_id','equipment_id'])
... .resample('15T')
... .mean()


I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?



 site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763


Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?










share|improve this question















I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.



>>>print(df.head())
site_id type_id equipment_id timestamp value
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868


I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code



>>> df = df 
... .set_index(['timestamp'])
... .sort_values(['site_id','type_id','equipment_id','timestamp'])
... .groupby(['site_id','type_id','equipment_id'])
... .resample('15T')
... .mean()


I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?



 site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763


Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?







pandas pandas-groupby






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 9 at 1:08

























asked Nov 9 at 1:01









David Waterworth

528416




528416











  • I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
    – Kris
    Nov 10 at 22:12
















  • I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
    – Kris
    Nov 10 at 22:12















I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
– Kris
Nov 10 at 22:12




I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
– Kris
Nov 10 at 22:12












1 Answer
1






active

oldest

votes

















up vote
0
down vote













This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:



df.sort_index(inplace=True)


For instance,



df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())


which produces:



 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228





share|improve this answer






















  • That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
    – David Waterworth
    Nov 10 at 21:53











  • Ah sorry, yes I see.
    – Kris
    Nov 10 at 22:08










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218356%2fhow-to-prevent-pandas-resample-from-resampling-id-columns%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:



df.sort_index(inplace=True)


For instance,



df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())


which produces:



 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228





share|improve this answer






















  • That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
    – David Waterworth
    Nov 10 at 21:53











  • Ah sorry, yes I see.
    – Kris
    Nov 10 at 22:08














up vote
0
down vote













This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:



df.sort_index(inplace=True)


For instance,



df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())


which produces:



 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228





share|improve this answer






















  • That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
    – David Waterworth
    Nov 10 at 21:53











  • Ah sorry, yes I see.
    – Kris
    Nov 10 at 22:08












up vote
0
down vote










up vote
0
down vote









This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:



df.sort_index(inplace=True)


For instance,



df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())


which produces:



 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228





share|improve this answer














This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:



df.sort_index(inplace=True)


For instance,



df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())


which produces:



 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 9 at 3:01

























answered Nov 9 at 2:50









Kris

5,30311219




5,30311219











  • That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
    – David Waterworth
    Nov 10 at 21:53











  • Ah sorry, yes I see.
    – Kris
    Nov 10 at 22:08
















  • That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
    – David Waterworth
    Nov 10 at 21:53











  • Ah sorry, yes I see.
    – Kris
    Nov 10 at 22:08















That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53





That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53













Ah sorry, yes I see.
– Kris
Nov 10 at 22:08




Ah sorry, yes I see.
– Kris
Nov 10 at 22:08

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218356%2fhow-to-prevent-pandas-resample-from-resampling-id-columns%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)