Remove specific values in a dataframe based on whether or not they're in an array

Remove specific values in a dataframe based on whether or not they're in an array



I have a dataframe with 2 columns, one of which is a comma separated list of values:


A 1,2,3,4,6

B 1,5,6,7

C 1,3,2,8,9,7

D 1,3,6,8



I also have an array: [2,3,9]


[2,3,9]



And I'd like to end up with the same dataframe, transformed in such a way that values not in the array are filtered out. Eg:


A 2,3

B

C 3,2,9

D 3



Could anyone point me in the right direction? I've had a look around but have hit a bit of a wall.






What's a "comma separated list"? A string?

– Denziloe
Sep 17 '18 at 16:52




3 Answers
3



Setup


import re

df = pd.DataFrame(
'col1': ['A', 'B', 'C', 'D'],
'col2': ['1,2,3,4,6', '1,5,6,7', '1,3,2,8,9,7', '1,3,6,8']
)

good = [str(i) for i in [2,3,9]]



We can use a regular expression and re.findall to extract all acceptable values, we just have to assert that the matches don't directly follow or precede a digit, so that we don't match digits in the middle of another number:


re.findall


rgx = '(?<!d)()(?!d)'.format('|'.join(good))

df.assign(out=[','.join(re.findall(rgx, row)) for row in df.col2])




col1 col2 out
0 A 1,2,3,4,6 2,3
1 B 1,5,6,7
2 C 1,3,2,8,9,7 3,2,9
3 D 1,3,6,8 3



Regex Explanation


(?<! # Negative lookbehind
d # Asserts previous character is *not* a digit
)
( # Matching group
2|3|9 # Matches either 2 or 3 or 9
)
(?! # Negative lookahead
d # Asserts the following character is *not* a digit
)



The apply-method provides a very readable solution in my opinion.


allowed = [2, 3, 9]
allowed_string = [str(x) for x in allowed]
df[1] = df[1].str.split(',')
df[1] = df[1].apply(lambda x: [y for y in x if y in allowed_string])



Output:


0 1
0 A [2, 3]
1 B
2 C [3, 2, 9]
3 D [3]



You can filer the column values using isin() method. see below example.


import pandas as pd

data = 'A':[1,2,3,4,6],
'B':[1,5,6,7],
'C':[1,3,2,8,9,7],
'D':[1,3,6,8]

allow_list = [2,3,9] #list of allowed elements

df = pd.concat([pd.Series(val, name=key) for key, val in data.items()], axis=1)

df1=df[df[df.columns].isin(allow_list)] #provide list of allowed elements as parameter in isin method
df1.dropna(how='all',inplace=True) #remove rows which are all NaN
print(df1)



Output:


A C B D
1 2.0 3.0 NaN 3.0
2 3.0 2.0 NaN NaN
4 NaN 9.0 NaN NaN



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy