Missing last word in a sentence when using regular expression
Missing last word in a sentence when using regular expression
Code:
import re
def main():
a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
b=word_find(a)
print(b)
def word_find(sentence_list):
word_list=
word_reg=re.compile(r"[(|)|,|'|"|:|[|]|| |--+|t|;]?(.+?)[(|)|,|'|"|:|[|]|| |--+|t|;]")
for i in range(len(sentence_list)):
words=re.findall(word_reg,sentence_list[i])
word_list.append(words)
return word_list
main()
What I need is to break every words into single elements of a list
now the output looks like this:
[['the', 'mississippi', 'is', 'well', 'worth', 'reading'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways']]
I found that the last word of the first sentence 'about'
and the second sentence 'remarkable'
is missing
'about'
'remarkable'
It might be some problem in my regular expression
word_reg=re.compile(r"[(|)|,|'|"|:|[|]|| |--+|t|;]?(.+?)[(|)|,|'|"|:|[|]|| |--+|t|;]")
But if I add a question mark into the last part of this regular expression like this:
[(|)|,|'|"|:|[|]|| |--+|t|;]**?**")
the result become many single letters instead of words. What can I do with it?
Edit:
The reason why I didn't use string.split is that there might be many ways for people to break words
For example: when people input a--b
, there is no space, but we have to break it into 'a','b'
a--b
string.split(' ')
I edited the question to explain why not string.split(" ")
– Yiling Liu
Sep 16 '18 at 18:00
5 Answers
5
Using the right tools is always the winning strategy. In your case, the right tool is the NLTK word tokenizer, because it was designed to do just that: break sentences into words.
import nltk
a = ['the mississippi is well worth reading about',
' it is not a commonplace river, but on the contrary is in all ways remarkable']
nltk.word_tokenize(a[1])
#['it', 'is', 'not', 'a', 'commonplace', 'river', ',', 'but',
# 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']
Suggest a simpler solution:
b = re.split(r"[W_]", a)
The regex [W_]
matches any single non-word characters (non-letter and non-digit and non-underline) plus the underline, which is practically enough.
[W_]
Your current regex requires that the word is followed by one of the characters in your list, but not "end of line", which can be matched with $
.
$
You can use re.split
and filter
:
re.split
filter
filter(None, re.split("[, -!?:]+", a])
Where I have put the string "[, -!?:]+"
, you should put whatever characters it is that are your delimiters. filter
will just remove any empty strings because of leading/trailing separators.
"[, -!?:]+"
filter
You can either find what you don't want and split on that:
>>> a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
>>> [re.split(r'W+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['', 'it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]
(You may need to filter the ''
elements produced by re.split
)
''
re.split
Or capture what you do want with re.findall
and keep those elements:
re.findall
>>> [re.findall(r'bw+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]
Thanks everybody
From others answers, the solution is to use re.split()
and there is a SUPER STAR NLTK in the uppermost answer
def word_find(sentence_list):
word_list=
for i in range(len(sentence_list)):
word_list.append(re.split('(|)|,|'|"|:|[|]|| |--+|t|;',sentence_list[i]))
return word_list
No need to use that many
|
(or), try this instead [(),'":[] t;-]+
– Srdjan M.
Sep 16 '18 at 18:16
|
[(),'":[] t;-]+
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy
is there a reason why you do not want to split the string on the whitespace like
string.split(' ')
?– Moritz
Sep 16 '18 at 17:59