Extracting only names [closed]

Extracting only names [closed]



I have a data as shown below and I have to extract only names out of it. May I know how could it be done? I am using Spacy for this problem and using entity label_=="PERSON" but this approach is getting failed when we have a single name like,


Ordered by: Potter



The data is shown below


Data="""

Ordered by: Jacob Green
Ordered by: nurse
Ordered by: doctor
Ordered by: Potter
Ordered by: MD
Ordered by: Doctor
Ordered by Morgan Olivia
Ordered by a physician
Ordered by: Dr. Ali Zafar
"""



Expected Output:


Jacob Green
Potter
Morgan Olivia
Ali Zafar



Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.





Maybe you could store the data in a Python dictionary instead? Just a thought. Up vote love is good...
– NewbieWanKenobi
Aug 22 at 4:29




3 Answers
3



I guess the best you can do is to assume that a person is someone at the end of a line whose one or more names (separated by white spaces) start with a capital letter and contain only lower-case letters:


import re
lines = Data.split('n')

# You may want to play with the definition of a "name"
is_name = re.compile(r'([A-Z][a-z]+(?:s+[A-Z][a-z]+)*)s*$')

[is_name.findall(line) for line in lines]
#[['Jacob Green'],
# ,
# ,
# ['Potter'],
# ['Morgan Olivia'],
# ,
# ['Ali Zafar']]





it could be: Ordered by: MD
– Slickmind
Aug 22 at 13:51





I do not understand your comnent.
– DYZ
Aug 22 at 16:10





I meant the line may contain data as Ordered by: MD I have edited my original question
– Slickmind
Aug 22 at 16:21






The regex that I suggested ignored MD. Should it not?
– DYZ
Aug 22 at 17:34





I really appreciate your efforts and the regex you have written is very helpful, however, the data is very unpredictable. One of the lines is like Ordered by: Doctor. This regex is getting failed
– Slickmind
Aug 22 at 18:09



I assume there are always "by" prefixes, and separators are spaces or tabs:


re.findall(r"(?:bby:?[ t]*(?:Dr.?[ t]*)?)([A-Z][a-z]+(?:[ t]+[A-Z][a-z]+)*)",Data)
Out: ['Jacob Green', 'Potter', 'Morgan Olivia', 'Ali Zafar']



([A-Z][a-z]+(?:[ t]+[A-Z][a-z]+)*) We capture only the name.





it could be: Ordered by: MD
– Slickmind
Aug 22 at 13:50



Although the NER is failing in this case, the POS tagging is still tagging this as a PROPN (proper noun). Perhaps you could use that feature as a double check on any lines that don't yield a named entity?


Data="""
Ordered by: Potter
"""

doc = nlp(Data)

for token in doc:
print(token.text + ' ' + token.pos_)

SPACE
Ordered VERB
by ADP
: PUNCT
Potter PROPN

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế

⃀⃉⃄⃅⃍,⃂₼₡₰⃉₡₿₢⃉₣⃄₯⃊₮₼₹₱₦₷⃄₪₼₶₳₫⃍₽ ₫₪₦⃆₠₥⃁₸₴₷⃊₹⃅⃈₰⃁₫ ⃎⃍₩₣₷ ₻₮⃊⃀⃄⃉₯,⃏⃊,₦⃅₪,₼⃀₾₧₷₾ ₻ ₸₡ ₾,₭⃈₴⃋,€⃁,₩ ₺⃌⃍⃁₱⃋⃋₨⃊⃁⃃₼,⃎,₱⃍₲₶₡ ⃍⃅₶₨₭,⃉₭₾₡₻⃀ ₼₹⃅₹,₻₭ ⃌