Extracting only names [closed]
Extracting only names [closed]
I have a data as shown below and I have to extract only names out of it. May I know how could it be done? I am using Spacy for this problem and using entity label_=="PERSON" but this approach is getting failed when we have a single name like,
Ordered by: Potter
The data is shown below
Data="""
Ordered by: Jacob Green
Ordered by: nurse
Ordered by: doctor
Ordered by: Potter
Ordered by: MD
Ordered by: Doctor
Ordered by Morgan Olivia
Ordered by a physician
Ordered by: Dr. Ali Zafar
"""
Expected Output:
Jacob Green
Potter
Morgan Olivia
Ali Zafar
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
3 Answers
3
I guess the best you can do is to assume that a person is someone at the end of a line whose one or more names (separated by white spaces) start with a capital letter and contain only lower-case letters:
import re
lines = Data.split('n')
# You may want to play with the definition of a "name"
is_name = re.compile(r'([A-Z][a-z]+(?:s+[A-Z][a-z]+)*)s*$')
[is_name.findall(line) for line in lines]
#[['Jacob Green'],
# ,
# ,
# ['Potter'],
# ['Morgan Olivia'],
# ,
# ['Ali Zafar']]
it could be: Ordered by: MD
– Slickmind
Aug 22 at 13:51
I do not understand your comnent.
– DYZ
Aug 22 at 16:10
I meant the line may contain data as Ordered by: MD I have edited my original question
– Slickmind
Aug 22 at 16:21
The regex that I suggested ignored MD. Should it not?
– DYZ
Aug 22 at 17:34
I really appreciate your efforts and the regex you have written is very helpful, however, the data is very unpredictable. One of the lines is like Ordered by: Doctor. This regex is getting failed
– Slickmind
Aug 22 at 18:09
I assume there are always "by" prefixes, and separators are spaces or tabs:
re.findall(r"(?:bby:?[ t]*(?:Dr.?[ t]*)?)([A-Z][a-z]+(?:[ t]+[A-Z][a-z]+)*)",Data)
Out: ['Jacob Green', 'Potter', 'Morgan Olivia', 'Ali Zafar']
([A-Z][a-z]+(?:[ t]+[A-Z][a-z]+)*) We capture only the name.
it could be: Ordered by: MD
– Slickmind
Aug 22 at 13:50
Although the NER is failing in this case, the POS tagging is still tagging this as a PROPN (proper noun). Perhaps you could use that feature as a double check on any lines that don't yield a named entity?
Data="""
Ordered by: Potter
"""
doc = nlp(Data)
for token in doc:
print(token.text + ' ' + token.pos_)
SPACE
Ordered VERB
by ADP
: PUNCT
Potter PROPN
Maybe you could store the data in a Python dictionary instead? Just a thought. Up vote love is good...
– NewbieWanKenobi
Aug 22 at 4:29