Combine Pandas columns containing list objects
Combine Pandas columns containing list objects
My question is similar to 'Combine pandas string columns with missing values
' and 'How to merge/combine columns in pandas?' but as the cells I want to combine contain lists, all the given answers don't work.
Simplified, my df looks like this:
players players1 players2 players3
1 ['1','2']
2 ['2','4']
3 ['1','4']
4 ['1','5']
5 ['3','5']
6
7 ['3','4']
(So i know there will never be values in two of the columns.)
Now I want to add a new column combining the lists:
players players1 players2 players3 players_combine
1 ['1','2'] ['1','2']
2 ['2','4'] ['2','4']
3 ['1','4'] ['1','4']
4 ['1','5'] ['1','5']
5 ['3','5'] ['3','5']
6
7 ['3','4'] ['3','4']
I tried many things - mainly variations of the linked answers, my last idea was a cascading np.where. But it didn't work. Only the values from 'players' showed up in the combined column.
df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))
EDIT:
As asked in the comments: df.head(5).to_dict()
'players': '5b41800eaffb061b88c4beac': ['57005', '124021', '132037', '78523', '111742', '133892', '76431', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '296770'], '5b41800eaffb061b88c4bead': ['18929', '110183', '28401', '302853', '296768', '94912', '93671', '52060', '43282', '132364', '140646', '77861', '19787', '133790', '312666', '76336', '317219', '137849'], '5b41800daffb061b88c4bc7f': 'nan', '5b41800eaffb061b88c4bd62': 'nan', '5b41800eaffb061b88c4bd65': 'nan', 'players1': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '76431'], '5b41800eaffb061b88c4bd62': '', '5b41800eaffb061b88c4bd65': '', 'players2': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '78066', '127464', '76431'], '5b41800eaffb061b88c4bd65': '', 'players3': '5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': nan, '5b41800eaffb061b88c4bd65': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '247798', '69756', '133745', '278877', '77857', '78066', '127464', '76431']
df.to_dict()
You cannot have "no values" in Pandas cells. What exactly do you have in them? Empty strings, empty lists, nans?
– DYZ
Sep 16 '18 at 20:08
What's the empty value in those other columns? Is it an empty string, or NaN, or an empty list?
– ALollz
Sep 16 '18 at 20:08
if you post
d = df.to_dict()
, we can simply do pd.DataFrame(d)
and reproduce the data frame exactly as is :)– RafaelC
Sep 16 '18 at 20:12
d = df.to_dict()
pd.DataFrame(d)
@J_Scholz then just do
df.head(5).to_dict()
;)– RafaelC
Sep 16 '18 at 20:15
df.head(5).to_dict()
4 Answers
4
Since you know you will only a value in at most a single column for each row, you can replace the values you don't care about with NaN
and then use .stack
.
NaN
.stack
In this case it looks like you have both 'nan'
and ''
strings which should be replaced with np.NaN
first.
'nan'
''
np.NaN
import numpy as np
df['players_combine'] = df.replace('': np.NaN, 'nan': np.NaN, regex=True).stack().reset_index(level=1, drop=True)
import pandas as pd
df = pd.DataFrame('players': [['1','2'], '', '', np.NaN, ''],
'players1': ['', ['2','4'], '', np.NaN, ''],
'players2': ['', '', ['1','5'], np.NaN, ''],
'players3': ['', '', np.NaN, ['3', '5'], ''])
players players1 players2 players3 players_combine
0 [1, 2] [1, 2]
1 [2, 4] [2, 4]
2 [1, 5] NaN [1, 5]
3 NaN NaN NaN [3, 5] [3, 5]
4 NaN
If any of the empty cells is an empty string (''
), first make it a NaN:
''
df[df==''] = np.nan
Then, select the maximum of all non-NaNs in each row:
df.apply(lambda x: x[x.notnull()].max(), axis=1)
#1 [1, 2]
#2 [2, 4]
#3 [1, 4]
#4 [1, 5]
#5 [3, 5]
#6 NaN
#7 [3, 4]
Another interesting (and faster) solution is to eliminate the all-NaN rows and then find the first valid value in each row:
df.loc[df.notnull().any(axis=1)]
.apply(lambda x: x[x.first_valid_index()], axis=1)
#1 [1, 2]
#2 [2, 4]
#3 [1, 4]
#4 [1, 5]
#5 [3, 5]
#7 [3, 4]
Thank you, the np.where solution is even faster than this. See my answer.
– J_Scholz
Sep 19 '18 at 20:14
Analyzing your df.to_dict()
it seems like, for some cells, you have the string nan
, for others your have the actual np.nan
and, for some others, you even have empty strings ''
.
df.to_dict()
nan
np.nan
''
So first clean your data set by making null values uniform:
df = df.replace('nan':np.nan, '':np.nan)
Then you can aggregate
through the axis=1
aggregate
axis=1
df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)
So it looks like the 'nan' were my problem all the time. I took them for real np.nan. Now my np.where line works.
– J_Scholz
Sep 17 '18 at 21:37
As some bad formatted data was the problem in my code and not the question asked, I wanted to add some information about the performance of all the different options. I used np.where because it uses vectorization whereas np.apply iterates through rows and therefore there is a huge performance difference.
Setting up test-df with 4*2000=8000 rows:
import pandas as pd
import numpy as np
l = [[['1','2'],np.NaN,np.NaN,np.NaN],
[np.NaN,['2','3'],np.NaN,np.NaN],
[np.NaN,np.NaN,['3','4'],np.NaN],
[np.NaN,np.NaN,np.NaN,['4','5']]]
l=l*2000
df = pd.DataFrame(l)
df.columns = ['players','players1','players2','players3']
Best option:
%timeit df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))
100 loops, best of 3: 2.18 ms per loop
Very good option:
%timeit df.loc[df.notnull().any(axis=1)]
.apply(lambda x: x[x.first_valid_index()], axis=1)
100 loops, best of 3: 413 ms per loop
Other option 1:
%timeit df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)
1 loop, best of 3: 4.71 s per loop
And 2:
%timeit df['players_combine'] = df.apply(lambda x: x[x.notnull()].max(), axis=1)
1 loop, best of 3: 4.86 s per loop
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy
can you post
df.to_dict()
?– RafaelC
Sep 16 '18 at 20:08