Remove specific values in a dataframe based on whether or not they're in an array

I have a dataframe with 2 columns, one of which is a comma separated list of values:

A 1,2,3,4,6 B 1,5,6,7 C 1,3,2,8,9,7 D 1,3,6,8

I also have an array: [2,3,9]

[2,3,9]

And I'd like to end up with the same dataframe, transformed in such a way that values not in the array are filtered out. Eg:

A 2,3 B C 3,2,9 D 3

Could anyone point me in the right direction? I've had a look around but have hit a bit of a wall.

What's a "comma separated list"? A string?

– Denziloe
Sep 17 '18 at 16:52

3 Answers
3

Setup

import re df = pd.DataFrame( 'col1': ['A', 'B', 'C', 'D'], 'col2': ['1,2,3,4,6', '1,5,6,7', '1,3,2,8,9,7', '1,3,6,8'] ) good = [str(i) for i in [2,3,9]]

We can use a regular expression and re.findall to extract all acceptable values, we just have to assert that the matches don't directly follow or precede a digit, so that we don't match digits in the middle of another number:

re.findall

rgx = '(?<!d)()(?!d)'.format('|'.join(good)) df.assign(out=[','.join(re.findall(rgx, row)) for row in df.col2])

col1 col2 out 0 A 1,2,3,4,6 2,3 1 B 1,5,6,7 2 C 1,3,2,8,9,7 3,2,9 3 D 1,3,6,8 3

Regex Explanation

(?<! # Negative lookbehind d # Asserts previous character is *not* a digit ) ( # Matching group 2|3|9 # Matches either 2 or 3 or 9 ) (?! # Negative lookahead d # Asserts the following character is *not* a digit )

The apply-method provides a very readable solution in my opinion.

allowed = [2, 3, 9] allowed_string = [str(x) for x in allowed] df[1] = df[1].str.split(',') df[1] = df[1].apply(lambda x: [y for y in x if y in allowed_string])

Output:

0 1 0 A [2, 3] 1 B 2 C [3, 2, 9] 3 D [3]

You can filer the column values using isin() method. see below example.

import pandas as pd data = 'A':[1,2,3,4,6], 'B':[1,5,6,7], 'C':[1,3,2,8,9,7], 'D':[1,3,6,8] allow_list = [2,3,9] #list of allowed elements df = pd.concat([pd.Series(val, name=key) for key, val in data.items()], axis=1) df1=df[df[df.columns].isin(allow_list)] #provide list of allowed elements as parameter in isin method df1.dropna(how='all',inplace=True) #remove rows which are all NaN print(df1)

Output:

A C B D 1 2.0 3.0 NaN 3.0 2 3.0 2.0 NaN NaN 4 NaN 9.0 NaN NaN

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

搜尋此網誌

Dfyjkt