Missing last word in a sentence when using regular expression

Code:

import re def main(): a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable'] b=word_find(a) print(b) def word_find(sentence_list): word_list= word_reg=re.compile(r"[(|)|,|'|"|:|[|]|| |--+|t|;]?(.+?)[(|)|,|'|"|:|[|]|| |--+|t|;]") for i in range(len(sentence_list)): words=re.findall(word_reg,sentence_list[i]) word_list.append(words) return word_list main()

What I need is to break every words into single elements of a list

now the output looks like this:

[['the', 'mississippi', 'is', 'well', 'worth', 'reading'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways']]

I found that the last word of the first sentence 'about' and the second sentence 'remarkable'is missing

'about'

'remarkable'

It might be some problem in my regular expression

word_reg=re.compile(r"[(|)|,|'|"|:|[|]|| |--+|t|;]?(.+?)[(|)|,|'|"|:|[|]|| |--+|t|;]")

But if I add a question mark into the last part of this regular expression like this:

[(|)|,|'|"|:|[|]|| |--+|t|;]**?**")

the result become many single letters instead of words. What can I do with it?

Edit:

The reason why I didn't use string.split is that there might be many ways for people to break words

For example: when people input a--b, there is no space, but we have to break it into 'a','b'

a--b

is there a reason why you do not want to split the string on the whitespace like string.split(' ') ?

– Moritz
Sep 16 '18 at 17:59

string.split(' ')

I edited the question to explain why not string.split(" ")

– Yiling Liu
Sep 16 '18 at 18:00

5 Answers
5

Using the right tools is always the winning strategy. In your case, the right tool is the NLTK word tokenizer, because it was designed to do just that: break sentences into words.

import nltk a = ['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable'] nltk.word_tokenize(a[1]) #['it', 'is', 'not', 'a', 'commonplace', 'river', ',', 'but', # 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']

Suggest a simpler solution:

b = re.split(r"[W_]", a)

The regex [W_] matches any single non-word characters (non-letter and non-digit and non-underline) plus the underline, which is practically enough.

[W_]

Your current regex requires that the word is followed by one of the characters in your list, but not "end of line", which can be matched with $.

$

You can use re.split and filter:

re.split

filter

filter(None, re.split("[, -!?:]+", a])

Where I have put the string "[, -!?:]+", you should put whatever characters it is that are your delimiters. filter will just remove any empty strings because of leading/trailing separators.

"[, -!?:]+"

filter

You can either find what you don't want and split on that:

>>> a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable'] >>> [re.split(r'W+', s) for s in a] [['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['', 'it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]

(You may need to filter the '' elements produced by re.split)

''

re.split

Or capture what you do want with re.findall and keep those elements:

re.findall

>>> [re.findall(r'bw+', s) for s in a] [['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]

Thanks everybody

From others answers, the solution is to use re.split()

and there is a SUPER STAR NLTK in the uppermost answer

def word_find(sentence_list): word_list= for i in range(len(sentence_list)): word_list.append(re.split('(|)|,|'|"|:|[|]|| |--+|t|;',sentence_list[i])) return word_list

No need to use that many | (or), try this instead [(),'":[] t;-]+

– Srdjan M.
Sep 16 '18 at 18:16

|

[(),'":[] t;-]+

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

搜尋此網誌

Dfyjkt