Remove specific values in a dataframe based on whether or not they're in an array
Remove specific values in a dataframe based on whether or not they're in an array
I have a dataframe with 2 columns, one of which is a comma separated list of values:
A 1,2,3,4,6
B 1,5,6,7
C 1,3,2,8,9,7
D 1,3,6,8
I also have an array: [2,3,9]
[2,3,9]
And I'd like to end up with the same dataframe, transformed in such a way that values not in the array are filtered out. Eg:
A 2,3
B
C 3,2,9
D 3
Could anyone point me in the right direction? I've had a look around but have hit a bit of a wall.
3 Answers
3
Setup
import re
df = pd.DataFrame(
'col1': ['A', 'B', 'C', 'D'],
'col2': ['1,2,3,4,6', '1,5,6,7', '1,3,2,8,9,7', '1,3,6,8']
)
good = [str(i) for i in [2,3,9]]
We can use a regular expression and re.findall
to extract all acceptable values, we just have to assert that the matches don't directly follow or precede a digit, so that we don't match digits in the middle of another number:
re.findall
rgx = '(?<!d)()(?!d)'.format('|'.join(good))
df.assign(out=[','.join(re.findall(rgx, row)) for row in df.col2])
col1 col2 out
0 A 1,2,3,4,6 2,3
1 B 1,5,6,7
2 C 1,3,2,8,9,7 3,2,9
3 D 1,3,6,8 3
Regex Explanation
(?<! # Negative lookbehind
d # Asserts previous character is *not* a digit
)
( # Matching group
2|3|9 # Matches either 2 or 3 or 9
)
(?! # Negative lookahead
d # Asserts the following character is *not* a digit
)
The apply-method provides a very readable solution in my opinion.
allowed = [2, 3, 9]
allowed_string = [str(x) for x in allowed]
df[1] = df[1].str.split(',')
df[1] = df[1].apply(lambda x: [y for y in x if y in allowed_string])
Output:
0 1
0 A [2, 3]
1 B
2 C [3, 2, 9]
3 D [3]
You can filer the column values using isin() method. see below example.
import pandas as pd
data = 'A':[1,2,3,4,6],
'B':[1,5,6,7],
'C':[1,3,2,8,9,7],
'D':[1,3,6,8]
allow_list = [2,3,9] #list of allowed elements
df = pd.concat([pd.Series(val, name=key) for key, val in data.items()], axis=1)
df1=df[df[df.columns].isin(allow_list)] #provide list of allowed elements as parameter in isin method
df1.dropna(how='all',inplace=True) #remove rows which are all NaN
print(df1)
Output:
A C B D
1 2.0 3.0 NaN 3.0
2 3.0 2.0 NaN NaN
4 NaN 9.0 NaN NaN
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy
What's a "comma separated list"? A string?
– Denziloe
Sep 17 '18 at 16:52