Missing last word in a sentence when using regular expression

Missing last word in a sentence when using regular expression



Code:


import re

def main():
a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
b=word_find(a)
print(b)

def word_find(sentence_list):
word_list=
word_reg=re.compile(r"[(|)|,|'|"|:|[|]|| |--+|t|;]?(.+?)[(|)|,|'|"|:|[|]|| |--+|t|;]")
for i in range(len(sentence_list)):
words=re.findall(word_reg,sentence_list[i])
word_list.append(words)
return word_list

main()



What I need is to break every words into single elements of a list



now the output looks like this:


[['the', 'mississippi', 'is', 'well', 'worth', 'reading'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways']]



I found that the last word of the first sentence 'about' and the second sentence 'remarkable'is missing


'about'


'remarkable'



It might be some problem in my regular expression


word_reg=re.compile(r"[(|)|,|'|"|:|[|]|| |--+|t|;]?(.+?)[(|)|,|'|"|:|[|]|| |--+|t|;]")



But if I add a question mark into the last part of this regular expression like this:


[(|)|,|'|"|:|[|]|| |--+|t|;]**?**")



the result become many single letters instead of words. What can I do with it?



Edit:



The reason why I didn't use string.split is that there might be many ways for people to break words



For example: when people input a--b, there is no space, but we have to break it into 'a','b'


a--b






is there a reason why you do not want to split the string on the whitespace like string.split(' ') ?

– Moritz
Sep 16 '18 at 17:59


string.split(' ')






I edited the question to explain why not string.split(" ")

– Yiling Liu
Sep 16 '18 at 18:00





5 Answers
5



Using the right tools is always the winning strategy. In your case, the right tool is the NLTK word tokenizer, because it was designed to do just that: break sentences into words.


import nltk
a = ['the mississippi is well worth reading about',
' it is not a commonplace river, but on the contrary is in all ways remarkable']
nltk.word_tokenize(a[1])
#['it', 'is', 'not', 'a', 'commonplace', 'river', ',', 'but',
# 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']



Suggest a simpler solution:


b = re.split(r"[W_]", a)



The regex [W_] matches any single non-word characters (non-letter and non-digit and non-underline) plus the underline, which is practically enough.


[W_]



Your current regex requires that the word is followed by one of the characters in your list, but not "end of line", which can be matched with $.


$



You can use re.split and filter:


re.split


filter


filter(None, re.split("[, -!?:]+", a])



Where I have put the string "[, -!?:]+", you should put whatever characters it is that are your delimiters. filter will just remove any empty strings because of leading/trailing separators.


"[, -!?:]+"


filter



You can either find what you don't want and split on that:


>>> a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
>>> [re.split(r'W+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['', 'it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]



(You may need to filter the '' elements produced by re.split)


''


re.split



Or capture what you do want with re.findall and keep those elements:


re.findall


>>> [re.findall(r'bw+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]



Thanks everybody



From others answers, the solution is to use re.split()



and there is a SUPER STAR NLTK in the uppermost answer


def word_find(sentence_list):
word_list=
for i in range(len(sentence_list)):
word_list.append(re.split('(|)|,|'|"|:|[|]|| |--+|t|;',sentence_list[i]))
return word_list






No need to use that many | (or), try this instead [(),'":[] t;-]+

– Srdjan M.
Sep 16 '18 at 18:16



|


[(),'":[] t;-]+



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)