How to prevent pandas resample from resampling id columns

up vote
0
down vote

favorite

I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.

>>>print(df.head())
site_id type_id equipment_id timestamp value 
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868

I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code

>>> df = df 
... .set_index(['timestamp']) 
... .sort_values(['site_id','type_id','equipment_id','timestamp']) 
... .groupby(['site_id','type_id','equipment_id']) 
... .resample('15T') 
... .mean()

I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?

 site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
 2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
 2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
 2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763

Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?

edited Nov 9 at 1:08

asked Nov 9 at 1:01

David Waterworth

528416

I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
– Kris
Nov 10 at 22:12

add a comment |

up vote
0
down vote

favorite

I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.

>>>print(df.head())
site_id type_id equipment_id timestamp value 
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868

I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code

>>> df = df 
... .set_index(['timestamp']) 
... .sort_values(['site_id','type_id','equipment_id','timestamp']) 
... .groupby(['site_id','type_id','equipment_id']) 
... .resample('15T') 
... .mean()

 site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
 2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
 2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
 2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763

Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?

edited Nov 9 at 1:08

asked Nov 9 at 1:01

David Waterworth

528416

I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
– Kris
Nov 10 at 22:12

add a comment |

up vote
0
down vote

favorite

I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.

>>>print(df.head())
site_id type_id equipment_id timestamp value 
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868

I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code

>>> df = df 
... .set_index(['timestamp']) 
... .sort_values(['site_id','type_id','equipment_id','timestamp']) 
... .groupby(['site_id','type_id','equipment_id']) 
... .resample('15T') 
... .mean()

 site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
 2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
 2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
 2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763

Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?

edited Nov 9 at 1:08

asked Nov 9 at 1:01

David Waterworth

528416

I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.

>>>print(df.head())
site_id type_id equipment_id timestamp value 
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868

I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code

>>> df = df 
... .set_index(['timestamp']) 
... .sort_values(['site_id','type_id','equipment_id','timestamp']) 
... .groupby(['site_id','type_id','equipment_id']) 
... .resample('15T') 
... .mean()

 site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
 2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
 2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
 2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763

Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?

pandas pandas-groupby

edited Nov 9 at 1:08

asked Nov 9 at 1:01

David Waterworth

528416

edited Nov 9 at 1:08

asked Nov 9 at 1:01

David Waterworth

528416

edited Nov 9 at 1:08

asked Nov 9 at 1:01

David Waterworth

528416

asked Nov 9 at 1:01

David Waterworth

528416

asked Nov 9 at 1:01

David Waterworth

528416

I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
– Kris
Nov 10 at 22:12

add a comment |

I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
– Kris
Nov 10 at 22:12

I'm not sure, but it might be intended behavior. The reason is that .resample(.) may yield rows that have NaNs due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull(). What do you think?
– Kris
Nov 10 at 22:12

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:

df.sort_index(inplace=True)

For instance,

df = pd.DataFrame(
 data=np.random.rand(5, 4),
 index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())

which produces:

 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 x 0.369067 0.944636 0.740877 0.751681
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228

edited Nov 9 at 3:01

answered Nov 9 at 2:50

Kris

5,30311219

That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53

Ah sorry, yes I see.
– Kris
Nov 10 at 22:08

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218356%2fhow-to-prevent-pandas-resample-from-resampling-id-columns%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:

df.sort_index(inplace=True)

For instance,

df = pd.DataFrame(
 data=np.random.rand(5, 4),
 index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())

which produces:

 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 x 0.369067 0.944636 0.740877 0.751681
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228

edited Nov 9 at 3:01

answered Nov 9 at 2:50

Kris

5,30311219

That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53

Ah sorry, yes I see.
– Kris
Nov 10 at 22:08

add a comment |

up vote
0
down vote

This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:

df.sort_index(inplace=True)

For instance,

df = pd.DataFrame(
 data=np.random.rand(5, 4),
 index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())

which produces:

 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 x 0.369067 0.944636 0.740877 0.751681
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228

edited Nov 9 at 3:01

answered Nov 9 at 2:50

Kris

5,30311219

That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53

Ah sorry, yes I see.
– Kris
Nov 10 at 22:08

add a comment |

up vote
0
down vote

This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:

df.sort_index(inplace=True)

For instance,

df = pd.DataFrame(
 data=np.random.rand(5, 4),
 index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())

which produces:

 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 x 0.369067 0.944636 0.740877 0.751681
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228

edited Nov 9 at 3:01

answered Nov 9 at 2:50

Kris

5,30311219

This happens to a MultiIndex if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:

df.sort_index(inplace=True)

For instance,

df = pd.DataFrame(
 data=np.random.rand(5, 4),
 index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())

which produces:

 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
 0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
 x 0.369067 0.944636 0.740877 0.751681
 y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
 y 0.444738 0.393664 0.877928 0.913228

edited Nov 9 at 3:01

answered Nov 9 at 2:50

Kris

5,30311219

edited Nov 9 at 3:01

answered Nov 9 at 2:50

Kris

5,30311219

answered Nov 9 at 2:50

Kris

5,30311219

answered Nov 9 at 2:50

Kris

5,30311219

That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53

Ah sorry, yes I see.
– Kris
Nov 10 at 22:08

add a comment |

That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53

Ah sorry, yes I see.
– Kris
Nov 10 at 22:08

That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53

Ah sorry, yes I see.
– Kris
Nov 10 at 22:08

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt