Combine Pandas columns containing list objects

Combine Pandas columns containing list objects



My question is similar to 'Combine pandas string columns with missing values
' and 'How to merge/combine columns in pandas?' but as the cells I want to combine contain lists, all the given answers don't work.



Simplified, my df looks like this:


players players1 players2 players3
1 ['1','2']
2 ['2','4']
3 ['1','4']
4 ['1','5']
5 ['3','5']
6
7 ['3','4']



(So i know there will never be values in two of the columns.)



Now I want to add a new column combining the lists:


players players1 players2 players3 players_combine
1 ['1','2'] ['1','2']
2 ['2','4'] ['2','4']
3 ['1','4'] ['1','4']
4 ['1','5'] ['1','5']
5 ['3','5'] ['3','5']
6
7 ['3','4'] ['3','4']



I tried many things - mainly variations of the linked answers, my last idea was a cascading np.where. But it didn't work. Only the values from 'players' showed up in the combined column.


df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))



EDIT:
As asked in the comments: df.head(5).to_dict()


'players': '5b41800eaffb061b88c4beac': ['57005', '124021', '132037', '78523', '111742', '133892', '76431', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '296770'], '5b41800eaffb061b88c4bead': ['18929', '110183', '28401', '302853', '296768', '94912', '93671', '52060', '43282', '132364', '140646', '77861', '19787', '133790', '312666', '76336', '317219', '137849'], '5b41800daffb061b88c4bc7f': 'nan', '5b41800eaffb061b88c4bd62': 'nan', '5b41800eaffb061b88c4bd65': 'nan', 'players1': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '76431'], '5b41800eaffb061b88c4bd62': '', '5b41800eaffb061b88c4bd65': '', 'players2': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '78066', '127464', '76431'], '5b41800eaffb061b88c4bd65': '', 'players3': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': nan, '5b41800eaffb061b88c4bd65': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '247798', '69756', '133745', '278877', '77857', '78066', '127464', '76431']






can you post df.to_dict()?

– RafaelC
Sep 16 '18 at 20:08


df.to_dict()






You cannot have "no values" in Pandas cells. What exactly do you have in them? Empty strings, empty lists, nans?

– DYZ
Sep 16 '18 at 20:08






What's the empty value in those other columns? Is it an empty string, or NaN, or an empty list?

– ALollz
Sep 16 '18 at 20:08






if you post d = df.to_dict(), we can simply do pd.DataFrame(d) and reproduce the data frame exactly as is :)

– RafaelC
Sep 16 '18 at 20:12


d = df.to_dict()


pd.DataFrame(d)






@J_Scholz then just do df.head(5).to_dict() ;)

– RafaelC
Sep 16 '18 at 20:15


df.head(5).to_dict()




4 Answers
4



Since you know you will only a value in at most a single column for each row, you can replace the values you don't care about with NaN and then use .stack.


NaN


.stack



In this case it looks like you have both 'nan' and '' strings which should be replaced with np.NaN first.


'nan'


''


np.NaN


import numpy as np

df['players_combine'] = df.replace('': np.NaN, 'nan': np.NaN, regex=True).stack().reset_index(level=1, drop=True)


import pandas as pd
df = pd.DataFrame('players': [['1','2'], '', '', np.NaN, ''],
'players1': ['', ['2','4'], '', np.NaN, ''],
'players2': ['', '', ['1','5'], np.NaN, ''],
'players3': ['', '', np.NaN, ['3', '5'], ''])


players players1 players2 players3 players_combine
0 [1, 2] [1, 2]
1 [2, 4] [2, 4]
2 [1, 5] NaN [1, 5]
3 NaN NaN NaN [3, 5] [3, 5]
4 NaN



If any of the empty cells is an empty string (''), first make it a NaN:


''


df[df==''] = np.nan



Then, select the maximum of all non-NaNs in each row:


df.apply(lambda x: x[x.notnull()].max(), axis=1)
#1 [1, 2]
#2 [2, 4]
#3 [1, 4]
#4 [1, 5]
#5 [3, 5]
#6 NaN
#7 [3, 4]



Another interesting (and faster) solution is to eliminate the all-NaN rows and then find the first valid value in each row:


df.loc[df.notnull().any(axis=1)]
.apply(lambda x: x[x.first_valid_index()], axis=1)
#1 [1, 2]
#2 [2, 4]
#3 [1, 4]
#4 [1, 5]
#5 [3, 5]
#7 [3, 4]






Thank you, the np.where solution is even faster than this. See my answer.

– J_Scholz
Sep 19 '18 at 20:14



Analyzing your df.to_dict() it seems like, for some cells, you have the string nan, for others your have the actual np.nan and, for some others, you even have empty strings ''.


df.to_dict()


nan


np.nan


''



So first clean your data set by making null values uniform:


df = df.replace('nan':np.nan, '':np.nan)



Then you can aggregate through the axis=1


aggregate


axis=1


df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)






So it looks like the 'nan' were my problem all the time. I took them for real np.nan. Now my np.where line works.

– J_Scholz
Sep 17 '18 at 21:37



As some bad formatted data was the problem in my code and not the question asked, I wanted to add some information about the performance of all the different options. I used np.where because it uses vectorization whereas np.apply iterates through rows and therefore there is a huge performance difference.



Setting up test-df with 4*2000=8000 rows:


import pandas as pd
import numpy as np

l = [[['1','2'],np.NaN,np.NaN,np.NaN],
[np.NaN,['2','3'],np.NaN,np.NaN],
[np.NaN,np.NaN,['3','4'],np.NaN],
[np.NaN,np.NaN,np.NaN,['4','5']]]

l=l*2000

df = pd.DataFrame(l)

df.columns = ['players','players1','players2','players3']



Best option:


%timeit df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))

100 loops, best of 3: 2.18 ms per loop



Very good option:


%timeit df.loc[df.notnull().any(axis=1)]
.apply(lambda x: x[x.first_valid_index()], axis=1)

100 loops, best of 3: 413 ms per loop



Other option 1:


%timeit df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)

1 loop, best of 3: 4.71 s per loop



And 2:


%timeit df['players_combine'] = df.apply(lambda x: x[x.notnull()].max(), axis=1)

1 loop, best of 3: 4.86 s per loop



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)