How to prevent pandas resample from resampling id columns
up vote
0
down vote
favorite
I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.
>>>print(df.head())
site_id type_id equipment_id timestamp value
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868
I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code
>>> df = df
... .set_index(['timestamp'])
... .sort_values(['site_id','type_id','equipment_id','timestamp'])
... .groupby(['site_id','type_id','equipment_id'])
... .resample('15T')
... .mean()
I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?
site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763
Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?
pandas pandas-groupby
add a comment |
up vote
0
down vote
favorite
I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.
>>>print(df.head())
site_id type_id equipment_id timestamp value
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868
I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code
>>> df = df
... .set_index(['timestamp'])
... .sort_values(['site_id','type_id','equipment_id','timestamp'])
... .groupby(['site_id','type_id','equipment_id'])
... .resample('15T')
... .mean()
I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?
site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763
Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?
pandas pandas-groupby
I'm not sure, but it might be intended behavior. The reason is that.resample(.)
may yield rows that haveNaN
s due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g.result.site_id.notnull()
. What do you think?
– Kris
Nov 10 at 22:12
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.
>>>print(df.head())
site_id type_id equipment_id timestamp value
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868
I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code
>>> df = df
... .set_index(['timestamp'])
... .sort_values(['site_id','type_id','equipment_id','timestamp'])
... .groupby(['site_id','type_id','equipment_id'])
... .resample('15T')
... .mean()
I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?
site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763
Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?
pandas pandas-groupby
I have a dataframe with id columns (site_id,type_id,equipment_id), a timestamp and a value as below.
>>>print(df.head())
site_id type_id equipment_id timestamp value
47 9 332859965468 2018-07-04 10:30:04.052000+10:00 23.000000
47 9 332859965468 2018-07-04 10:30:04.064000+10:00 22.050505
47 9 332859965468 2018-07-04 10:30:04.090000+10:00 26.046154
47 9 332859965468 2018-07-04 10:30:04.101000+10:00 22.000000
47 9 332859965468 2018-07-04 10:30:04.113000+10:00 191.989868
I'm trying to resample within each (site_id,type_id,equipment_id) group using the following code
>>> df = df
... .set_index(['timestamp'])
... .sort_values(['site_id','type_id','equipment_id','timestamp'])
... .groupby(['site_id','type_id','equipment_id'])
... .resample('15T')
... .mean()
I'm getting unexpected results, all of the id values from the index have been duplicated. It seems to be using the dtype rather than whether the column is in the index or not to perform the aggregation? Am I doning something wrong?
site_id type_id equipment_id value
site_id type_id equipment_id timestamp
47 9 332859965468 2018-07-04 10:30:00+10:00 47.0 9.0 3.328600e+11 58.718625
2018-07-04 10:45:00+10:00 47.0 9.0 3.328600e+11 59.175833
2018-07-04 11:00:00+10:00 47.0 9.0 3.328600e+11 59.238318
2018-07-04 11:15:00+10:00 47.0 9.0 3.328600e+11 58.982763
Edit: I've noticed adding .reset_index(drop=True) removes the duplicate columns - but the issue now is the integer id columns have been converted to floats?
pandas pandas-groupby
pandas pandas-groupby
edited Nov 9 at 1:08
asked Nov 9 at 1:01
David Waterworth
528416
528416
I'm not sure, but it might be intended behavior. The reason is that.resample(.)
may yield rows that haveNaN
s due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g.result.site_id.notnull()
. What do you think?
– Kris
Nov 10 at 22:12
add a comment |
I'm not sure, but it might be intended behavior. The reason is that.resample(.)
may yield rows that haveNaN
s due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g.result.site_id.notnull()
. What do you think?
– Kris
Nov 10 at 22:12
I'm not sure, but it might be intended behavior. The reason is that
.resample(.)
may yield rows that have NaN
s due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull()
. What do you think?– Kris
Nov 10 at 22:12
I'm not sure, but it might be intended behavior. The reason is that
.resample(.)
may yield rows that have NaN
s due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g. result.site_id.notnull()
. What do you think?– Kris
Nov 10 at 22:12
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
This happens to a MultiIndex
if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:
df.sort_index(inplace=True)
For instance,
df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())
which produces:
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53
Ah sorry, yes I see.
– Kris
Nov 10 at 22:08
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
This happens to a MultiIndex
if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:
df.sort_index(inplace=True)
For instance,
df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())
which produces:
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53
Ah sorry, yes I see.
– Kris
Nov 10 at 22:08
add a comment |
up vote
0
down vote
This happens to a MultiIndex
if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:
df.sort_index(inplace=True)
For instance,
df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())
which produces:
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53
Ah sorry, yes I see.
– Kris
Nov 10 at 22:08
add a comment |
up vote
0
down vote
up vote
0
down vote
This happens to a MultiIndex
if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:
df.sort_index(inplace=True)
For instance,
df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())
which produces:
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
This happens to a MultiIndex
if the index isn't sorted. If you'd like to have the index looking "clean" again, you could do:
df.sort_index(inplace=True)
For instance,
df = pd.DataFrame(
data=np.random.rand(5, 4),
index=pd.MultiIndex.from_tuples([(i, j) for i, j in zip(np.random.choice(['a', 'b'], 5), np.random.choice(['x', 'y'], 5))])
)
print(df)
print(df.sort_index())
which produces:
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
a x 0.369067 0.944636 0.740877 0.751681
0 1 2 3
a x 0.198659 0.616800 0.438903 0.830216
x 0.369067 0.944636 0.740877 0.751681
y 0.649111 0.860940 0.440068 0.044067
b x 0.178537 0.601514 0.898179 0.140358
y 0.444738 0.393664 0.877928 0.913228
edited Nov 9 at 3:01
answered Nov 9 at 2:50
Kris
5,30311219
5,30311219
That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53
Ah sorry, yes I see.
– Kris
Nov 10 at 22:08
add a comment |
That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53
Ah sorry, yes I see.
– Kris
Nov 10 at 22:08
That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53
That's not really what I'm seeing though - in my example above the site_id column appears twice. Once in the index and once in the dataframe? If I change it from numeric to string it works as expected
– David Waterworth
Nov 10 at 21:53
Ah sorry, yes I see.
– Kris
Nov 10 at 22:08
Ah sorry, yes I see.
– Kris
Nov 10 at 22:08
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218356%2fhow-to-prevent-pandas-resample-from-resampling-id-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I'm not sure, but it might be intended behavior. The reason is that
.resample(.)
may yield rows that haveNaN
s due to empty resampling buckets. To see this, just decrease the resampling period. Perhaps you want to be able to filter by e.g.result.site_id.notnull()
. What do you think?– Kris
Nov 10 at 22:12