Pandas - show percentage of values in one column, grouped by other column
Pandas - show percentage of values in one column, grouped by other column
So I have a Pandas DataFrame with two columns:
first is Grade with values 0 to 9
second is Criteria, with values 0 or 1.
Grade (0-9/ Criteria(0/1)
Grade Criteria
0 0 1
1 1 0
2 2 1
3 2 0
4 5 1
5 2 1
etc
I need to count "Criteria rate", which is actually sum of "1"s in Criteria column, divided by appropriate number of rows in Criteria column, but grouped by Grade column values.
For example, for Grade = 2 we count sum of 1 in Criteria column and divide it by number of rows with Grade 2: 2/3, so for Grade 2 we get 0.66 approx.
In my example, the answer should look like:
Grade / Criteria rate
Grade Criteria
0 0 1.000000
1 1 0.000000
2 2 0.666667
3 5 1.000000
Any ideas, how to do this?
Also the add. question - how to do this, if we have "yes/no" text values in Criteria column?
I've searched here, but found only solutions to groupby's, divided by total rows count etc.
Thank you!
2 Answers
2
You can aggregate sum
with size
and then divide columns:
sum
size
df = df.groupby('Grade')['Criteria'].agg(['sum','size'])
df['new'] = df['sum'] / df['size']
print (df)
sum size new
Grade
0 1 1 1.000000
1 0 1 0.000000
2 2 3 0.666667
5 1 1 1.000000
Or use custom function:
#not exclude NaNs
df = df.groupby('Grade')['Criteria'].agg(lambda x: x.sum() / len(x)).reset_index(name='new')
#exclude possible NaNs
df = df.groupby('Grade')['Criteria'].agg(lambda x: x.sum() / x.count()).reset_index(name='new')
For yes/no
values working with boolean mask - True
s are processes like 1
s:
yes/no
True
1
print (df)
Grade Criteria
0 0 yes
1 1 no
2 2 yes
3 2 no
4 5 yes
5 2 yes
df = (df['Criteria'] == 'yes').groupby(df['Grade']).agg(lambda x: x.sum() / len(x)).reset_index(name='new')
print (df)
Grade new
0 0 1.000000
1 1 0.000000
2 2 0.666667
3 5 1.000000
If criteria is 1
or 0
, or even True
or False
1
0
True
False
You can use mean
mean
groupby
df.groupby('Grade').mean()
Criteria
Grade
0 1.000000
1 0.000000
2 0.666667
5 1.000000
set_index
mean
df.set_index('Grade').mean(level=0)
Criteria
Grade
0 1.000000
1 0.000000
2 0.666667
5 1.000000
In the case that 'Criteria'
are 'yes'
and 'no'
strings
'Criteria'
'yes'
'no'
df
Grade Criteria
0 0 yes
1 1 no
2 2 yes
3 2 no
4 5 yes
5 2 yes
You can group the boolean evaluation
df.Criteria.eq('yes').groupby(df.Grade).mean()
Grade
0 1.000000
1 0.000000
2 0.666667
5 1.000000
Name: Criteria, dtype: float64
Use reset_index
on any of these answers to get the desired dataframe
reset_index
and add answer for yes and no values.
– jezrael
Sep 5 '18 at 16:27
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
@Emilkindt - If my or another answer was helpful, don't forget accept it - click on the check mark beside the answer to toggle it from greyed out to filled in. Thanks.
– jezrael
Sep 13 '18 at 14:16