Combine Pandas columns containing list objects

My question is similar to 'Combine pandas string columns with missing values
' and 'How to merge/combine columns in pandas?' but as the cells I want to combine contain lists, all the given answers don't work.

Simplified, my df looks like this:

players players1 players2 players3 1 ['1','2'] 2 ['2','4'] 3 ['1','4'] 4 ['1','5'] 5 ['3','5'] 6 7 ['3','4']

(So i know there will never be values in two of the columns.)

Now I want to add a new column combining the lists:

players players1 players2 players3 players_combine 1 ['1','2'] ['1','2'] 2 ['2','4'] ['2','4'] 3 ['1','4'] ['1','4'] 4 ['1','5'] ['1','5'] 5 ['3','5'] ['3','5'] 6 7 ['3','4'] ['3','4']

I tried many things - mainly variations of the linked answers, my last idea was a cascading np.where. But it didn't work. Only the values from 'players' showed up in the combined column.

df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))

EDIT:
As asked in the comments: df.head(5).to_dict()

'players': '5b41800eaffb061b88c4beac': ['57005', '124021', '132037', '78523', '111742', '133892', '76431', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '296770'], '5b41800eaffb061b88c4bead': ['18929', '110183', '28401', '302853', '296768', '94912', '93671', '52060', '43282', '132364', '140646', '77861', '19787', '133790', '312666', '76336', '317219', '137849'], '5b41800daffb061b88c4bc7f': 'nan', '5b41800eaffb061b88c4bd62': 'nan', '5b41800eaffb061b88c4bd65': 'nan', 'players1': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '76431'], '5b41800eaffb061b88c4bd62': '', '5b41800eaffb061b88c4bd65': '', 'players2': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '78066', '127464', '76431'], '5b41800eaffb061b88c4bd65': '', 'players3': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': nan, '5b41800eaffb061b88c4bd65': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '247798', '69756', '133745', '278877', '77857', '78066', '127464', '76431']

can you post df.to_dict()?

– RafaelC
Sep 16 '18 at 20:08

df.to_dict()

You cannot have "no values" in Pandas cells. What exactly do you have in them? Empty strings, empty lists, nans?

– DYZ
Sep 16 '18 at 20:08

What's the empty value in those other columns? Is it an empty string, or NaN, or an empty list?

– ALollz
Sep 16 '18 at 20:08

if you post d = df.to_dict(), we can simply do pd.DataFrame(d) and reproduce the data frame exactly as is :)

– RafaelC
Sep 16 '18 at 20:12

d = df.to_dict()

pd.DataFrame(d)

@J_Scholz then just do df.head(5).to_dict() ;)

– RafaelC
Sep 16 '18 at 20:15

df.head(5).to_dict()

4 Answers
4

Since you know you will only a value in at most a single column for each row, you can replace the values you don't care about with NaN and then use .stack.

NaN

.stack

In this case it looks like you have both 'nan' and '' strings which should be replaced with np.NaN first.

'nan'

''

np.NaN

import numpy as np df['players_combine'] = df.replace('': np.NaN, 'nan': np.NaN, regex=True).stack().reset_index(level=1, drop=True)

import pandas as pd df = pd.DataFrame('players': [['1','2'], '', '', np.NaN, ''], 'players1': ['', ['2','4'], '', np.NaN, ''], 'players2': ['', '', ['1','5'], np.NaN, ''], 'players3': ['', '', np.NaN, ['3', '5'], ''])

players players1 players2 players3 players_combine 0 [1, 2] [1, 2] 1 [2, 4] [2, 4] 2 [1, 5] NaN [1, 5] 3 NaN NaN NaN [3, 5] [3, 5] 4 NaN

If any of the empty cells is an empty string (''), first make it a NaN:

''

df[df==''] = np.nan

Then, select the maximum of all non-NaNs in each row:

df.apply(lambda x: x[x.notnull()].max(), axis=1) #1 [1, 2] #2 [2, 4] #3 [1, 4] #4 [1, 5] #5 [3, 5] #6 NaN #7 [3, 4]

Another interesting (and faster) solution is to eliminate the all-NaN rows and then find the first valid value in each row:

df.loc[df.notnull().any(axis=1)] .apply(lambda x: x[x.first_valid_index()], axis=1) #1 [1, 2] #2 [2, 4] #3 [1, 4] #4 [1, 5] #5 [3, 5] #7 [3, 4]

Thank you, the np.where solution is even faster than this. See my answer.

– J_Scholz
Sep 19 '18 at 20:14

Analyzing your df.to_dict() it seems like, for some cells, you have the string nan, for others your have the actual np.nan and, for some others, you even have empty strings ''.

df.to_dict()

nan

np.nan

''

So first clean your data set by making null values uniform:

df = df.replace('nan':np.nan, '':np.nan)

Then you can aggregate through the axis=1

aggregate

axis=1

df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)

So it looks like the 'nan' were my problem all the time. I took them for real np.nan. Now my np.where line works.

– J_Scholz
Sep 17 '18 at 21:37

As some bad formatted data was the problem in my code and not the question asked, I wanted to add some information about the performance of all the different options. I used np.where because it uses vectorization whereas np.apply iterates through rows and therefore there is a huge performance difference.

Setting up test-df with 4*2000=8000 rows:

import pandas as pd import numpy as np l = [[['1','2'],np.NaN,np.NaN,np.NaN], [np.NaN,['2','3'],np.NaN,np.NaN], [np.NaN,np.NaN,['3','4'],np.NaN], [np.NaN,np.NaN,np.NaN,['4','5']]] l=l*2000 df = pd.DataFrame(l) df.columns = ['players','players1','players2','players3']

Best option:

%timeit df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan)))) 100 loops, best of 3: 2.18 ms per loop

Very good option:

%timeit df.loc[df.notnull().any(axis=1)] .apply(lambda x: x[x.first_valid_index()], axis=1) 100 loops, best of 3: 413 ms per loop

Other option 1:

%timeit df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1) 1 loop, best of 3: 4.71 s per loop

And 2:

%timeit df['players_combine'] = df.apply(lambda x: x[x.notnull()].max(), axis=1) 1 loop, best of 3: 4.86 s per loop

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

搜尋此網誌

Dfyjkt