Extracting only names [closed]

I have a data as shown below and I have to extract only names out of it. May I know how could it be done? I am using Spacy for this problem and using entity label_=="PERSON" but this approach is getting failed when we have a single name like,

Ordered by: Potter

The data is shown below

Data=""" Ordered by: Jacob Green Ordered by: nurse Ordered by: doctor Ordered by: Potter Ordered by: MD Ordered by: Doctor Ordered by Morgan Olivia Ordered by a physician Ordered by: Dr. Ali Zafar """

Expected Output:

Jacob Green Potter Morgan Olivia Ali Zafar

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

Maybe you could store the data in a Python dictionary instead? Just a thought. Up vote love is good...
– NewbieWanKenobi
Aug 22 at 4:29

3 Answers
3

I guess the best you can do is to assume that a person is someone at the end of a line whose one or more names (separated by white spaces) start with a capital letter and contain only lower-case letters:

import re lines = Data.split('n') # You may want to play with the definition of a "name" is_name = re.compile(r'([A-Z][a-z]+(?:s+[A-Z][a-z]+)*)s*$') [is_name.findall(line) for line in lines] #[['Jacob Green'], # , # , # ['Potter'], # ['Morgan Olivia'], # , # ['Ali Zafar']]

it could be: Ordered by: MD
– Slickmind
Aug 22 at 13:51

I do not understand your comnent.
– DYZ
Aug 22 at 16:10

I meant the line may contain data as Ordered by: MD I have edited my original question
– Slickmind
Aug 22 at 16:21

The regex that I suggested ignored MD. Should it not?
– DYZ
Aug 22 at 17:34

I really appreciate your efforts and the regex you have written is very helpful, however, the data is very unpredictable. One of the lines is like Ordered by: Doctor. This regex is getting failed
– Slickmind
Aug 22 at 18:09

I assume there are always "by" prefixes, and separators are spaces or tabs:

re.findall(r"(?:bby:?[ t]*(?:Dr.?[ t]*)?)([A-Z][a-z]+(?:[ t]+[A-Z][a-z]+)*)",Data) Out: ['Jacob Green', 'Potter', 'Morgan Olivia', 'Ali Zafar']

([A-Z][a-z]+(?:[ t]+[A-Z][a-z]+)*) We capture only the name.

it could be: Ordered by: MD
– Slickmind
Aug 22 at 13:50

Although the NER is failing in this case, the POS tagging is still tagging this as a PROPN (proper noun). Perhaps you could use that feature as a double check on any lines that don't yield a named entity?

Data=""" Ordered by: Potter """ doc = nlp(Data) for token in doc: print(token.text + ' ' + token.pos_) SPACE Ordered VERB by ADP : PUNCT Potter PROPN

搜尋此網誌

Dfyjkt