Remove specific values in a dataframe based on whether or not they're in an array

Remove specific values in a dataframe based on whether or not they're in an array



I have a dataframe with 2 columns, one of which is a comma separated list of values:


A 1,2,3,4,6

B 1,5,6,7

C 1,3,2,8,9,7

D 1,3,6,8



I also have an array: [2,3,9]


[2,3,9]



And I'd like to end up with the same dataframe, transformed in such a way that values not in the array are filtered out. Eg:


A 2,3

B

C 3,2,9

D 3



Could anyone point me in the right direction? I've had a look around but have hit a bit of a wall.






What's a "comma separated list"? A string?

– Denziloe
Sep 17 '18 at 16:52




3 Answers
3



Setup


import re

df = pd.DataFrame(
'col1': ['A', 'B', 'C', 'D'],
'col2': ['1,2,3,4,6', '1,5,6,7', '1,3,2,8,9,7', '1,3,6,8']
)

good = [str(i) for i in [2,3,9]]



We can use a regular expression and re.findall to extract all acceptable values, we just have to assert that the matches don't directly follow or precede a digit, so that we don't match digits in the middle of another number:


re.findall


rgx = '(?<!d)()(?!d)'.format('|'.join(good))

df.assign(out=[','.join(re.findall(rgx, row)) for row in df.col2])




col1 col2 out
0 A 1,2,3,4,6 2,3
1 B 1,5,6,7
2 C 1,3,2,8,9,7 3,2,9
3 D 1,3,6,8 3



Regex Explanation


(?<! # Negative lookbehind
d # Asserts previous character is *not* a digit
)
( # Matching group
2|3|9 # Matches either 2 or 3 or 9
)
(?! # Negative lookahead
d # Asserts the following character is *not* a digit
)



The apply-method provides a very readable solution in my opinion.


allowed = [2, 3, 9]
allowed_string = [str(x) for x in allowed]
df[1] = df[1].str.split(',')
df[1] = df[1].apply(lambda x: [y for y in x if y in allowed_string])



Output:


0 1
0 A [2, 3]
1 B
2 C [3, 2, 9]
3 D [3]



You can filer the column values using isin() method. see below example.


import pandas as pd

data = 'A':[1,2,3,4,6],
'B':[1,5,6,7],
'C':[1,3,2,8,9,7],
'D':[1,3,6,8]

allow_list = [2,3,9] #list of allowed elements

df = pd.concat([pd.Series(val, name=key) for key, val in data.items()], axis=1)

df1=df[df[df.columns].isin(allow_list)] #provide list of allowed elements as parameter in isin method
df1.dropna(how='all',inplace=True) #remove rows which are all NaN
print(df1)



Output:


A C B D
1 2.0 3.0 NaN 3.0
2 3.0 2.0 NaN NaN
4 NaN 9.0 NaN NaN



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Crossroads (UK TV series)

ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế