Pandas - Binary Matrix and String Data (how to stack string data / new line?)
Pandas - Binary Matrix and String Data (how to stack string data / new line?)
I am trying to generate a particularly structured dataframe, but i cant seem to "stack" the data. My sample raw data:
# raw data
df = pd.DataFrame('Name':['name1', 'name2', 'name3', 'name1', 'name2', 'name3', 'name1', 'name2', 'name3' ],
'Year':['freshman','sophomore','freshman', 'freshman','sophomore','freshman', 'freshman','sophomore','freshman'],
'Rotation':['ERJD','PEDI','MAM','PEDI', 'ERJD','PEDI','MAM','ERJD','ABD'],
'Week1':[1,1,1,0,0,0,0,0,0],
'Week2':[0,0,0,1,1,1,0,0,0],
'Week3':[0,0,0,0,0,0,1,1,1],
'Week4':[1,0,0,0,0,0,0,1,1]
)
df = df[['Name','Year','Rotation','Week1','Week2','Week3','Week4']]
Which looks like:
Name Year Rotation Week1 Week2 Week3 Week4
0 name1 freshman ERJD 1 0 0 1
1 name2 sophomore PEDI 1 0 0 0
2 name3 freshman MAM 1 0 0 0
3 name1 freshman PEDI 0 1 0 0
4 name2 sophomore ERJD 0 1 0 0
5 name3 freshman PEDI 0 1 0 0
6 name1 freshman MAM 0 0 1 0
7 name2 sophomore ERJD 0 0 1 1
8 name3 freshman ABD 0 0 1 1
I reshape the dataframe:
#Reshape Table + Filtering
df = pd.melt(df,
id_vars=['Name','Year','Rotation'],
value_vars=list(df.columns[3:]),
var_name='Week',
value_name='Sum of Value')
df = df.loc[df['Sum of Value'] == 1].reset_index()
df.pop('index')
Which Generates:
Name Year Rotation Week Sum of Value
0 name1 freshman ERJD Week1 1
1 name2 sophomore PEDI Week1 1
2 name3 freshman MAM Week1 1
3 name1 freshman PEDI Week2 1
4 name2 sophomore ERJD Week2 1
5 name3 freshman PEDI Week2 1
6 name1 freshman MAM Week3 1
7 name2 sophomore ERJD Week3 1
8 name3 freshman ABD Week3 1
9 name1 freshman ERJD Week4 1
10 name2 sophomore ERJD Week4 1
11 name3 freshman ABD Week4 1
I create a pivot table:
#Create Pivot
pivot = df.pivot_table(index=['Rotation','Year'], columns='Week', values='Name', aggfunc=lambda x: ' '.join(x))
pivot = pivot.reindex(weeks, axis=1) # Change order of Columns
pivot
Which generates:
Week1 Week2 Week3 Week4
Rotation Year
ABD freshman None None name3 name3
ERJD freshman name1 None None name1
sophomore None name2 name2 name2
MAM freshman name3 None name1 None
PEDI freshman None name1 name3 None None
sophomore name2 None None None
I would like to stack the names within the table, for instance Week2 PEDI has name1
and name3
side by side. How can i put the names on different lines? Is there a better way to go about this instead of using the pivot table? Is the pd.melt
step even necessary?
name1
name3
pd.melt
Desired Structure:
Week1 Week2 Week3 Week4
Rotation Year
ABD freshman None None name3 name3
ERJD freshman name1 None None name1
sophomore None name2 name2 name2
MAM freshman name3 None name1 None
PEDI freshman None name1 None None
name3
sophomore name2 None None None
Thanks in advance for the help!
SOLUTION:
After pd.melt
, do the following:
pd.melt
df['aggval'] = df['Week'].map(str) + df['Rotation']
df['aggval'] = df.groupby(['aggval']).cumcount()+1
pivot = df.pivot_table(index=['Rotation','aggval'], columns='Week', values='Name', aggfunc=lambda x: ' '.join(x)).fillna('')
pivot = pivot.reindex(weeks, axis=1)
@rahlf23 Right now this is all testing in a jupyter notebook. The dataframe generated after pd.melt will be the results that are stored in the database. The customer is requesting the results to be shared in the desired structure somehow, if possible. I just don't know if it is.
– Steve
Aug 20 at 21:22
My point here being that you are trying to add a nested index level here by having the names take up multiple rows. I doubt that querying this information in this manner is what you want.
– rahlf23
Aug 20 at 21:49
@rahlf23 it isnt what i want, its specifically what is being requested. Is it possible?
– Steve
Aug 20 at 22:02
@rahlf23 would it make a difference if index 'Year' was removed from the table, leaving only 'Rotation' in index ?
– Steve
Aug 20 at 22:25
2 Answers
2
You could loop through the weeks that you are interested in and conditionally populate your dataframe, like so:
for week in ['Week1','Week2','Week3','Week4']:
df[week] = np.where(df[week]==1, df['Name'], df[week])
This gives:
Name Year Rotation Week1 Week2 Week3 Week4
0 name1 freshman ERJD name1 0 0 name1
1 name2 sophmore PEDI name2 0 0 0
2 name3 freshman MAM name3 0 0 0
3 name1 freshman PEDI 0 name1 0 0
4 name2 sophmore ERJD 0 name2 0 0
5 name3 freshman PEDI 0 name3 0 0
6 name1 freshman MAM 0 0 name1 0
7 name2 sophmore ERJD 0 0 name2 name2
8 name3 freshman ABD 0 0 name3 name3
Then you can group your dataframe and store the string-type entries in lists:
grouped = df.drop('Name', axis=1).groupby(['Rotation','Year']).agg(lambda x: [i for i in x if type(i)==str])
Which gives:
Week1 Week2 Week3 Week4
Rotation Year
ABD freshman [name3] [name3]
ERJD freshman [name1] [name1]
sophmore [name2] [name2] [name2]
MAM freshman [name3] [name1]
PEDI freshman [name1, name3]
sophmore [name2]
Note that there is an error in your desired output in your OP. There is no ('MAM','sophmore')
group. Also note that 'sophmore'
is spelled 'sophomore'
, just for clarity.
('MAM','sophmore')
'sophmore'
'sophomore'
thanks for providing a different approach and catching my spelling error. I made updates to the post. Your proposed solution didnt stack the names. I was able to do so, providing the solution above.
– Steve
Aug 21 at 19:11
You bet. If this solution helped you I would appreciate the upvote / check mark. I would also be interested in seeing your final solution.
– rahlf23
Aug 21 at 19:28
Will do when i get enough points to do so. I apparently don't have enough points to upvote yet...
– Steve
Aug 21 at 19:43
@Steve your solution looks good. I think it would be worth posting as an answer.
– user3483203
Aug 21 at 19:54
You can do this using set_index
and mul
:
set_index
mul
df1 = df.set_index(['Rotation','Year'])
df1.filter(like='Week').mul(df1['Name'], axis=0)
.replace('',np.nan)
.sort_index()
Output:
Week1 Week2 Week3 Week4
Rotation Year
ABD freshman NaN NaN name3 name3
ERJD freshman name1 NaN NaN name1
sophomore NaN name2 NaN NaN
sophomore NaN NaN name2 name2
MAM freshman name3 NaN NaN NaN
freshman NaN NaN name1 NaN
PEDI freshman NaN name1 NaN NaN
freshman NaN name3 NaN NaN
sophomore name2 NaN NaN NaN
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
How would you expect that output to be queried?
– rahlf23
Aug 20 at 20:57