Creating a regex to match the following scenario
Creating a regex to match the following scenario
I am a student working on a small research project where I need to scrape web pages which match the following requirement :-
If word X say "abc" is found anywhere in the text, look for pattern Y say "pqr" if it occurs within a 25 character window on either side of the occurrence of X.
Eg.
pqrxyz
is valid.
pqrxyz
xyz
is invalid.
xyz
xyzpqr
is valid.
xyzpqr
pqr123456789123456789123456789xyz
is invalid.
pqr123456789123456789123456789xyz
I can't figure this out. Any help will be greatly appreciated.
((?=pqr).20,abc) | (pqr20,(?!abc))
This is my attempt so far. I don't know how to incorporate the 20 character window constraint.
You may use
r'pqr.0,25xyz|xyz.0,25pqr'
– anubhava
Aug 27 at 2:19
r'pqr.0,25xyz|xyz.0,25pqr'
2 Answers
2
.
is the regular expression for "any single character."
.
n,m
is the regular expression for "at least n, and no more than m, repetitions of the previous regular expression."
n,m
So, the regular expression xyz.0,25pqr
means "xyz
, followed by up to twenty-five characters, followed by pqr
".
xyz.0,25pqr
xyz
pqr
So, accounting for the possibility of pqr
occurring before or after xyz
, we get this line of python code:
pqr
xyz
if re.search('pqr.0,25xyz', line) or re.search('xyz.0,25pqr', line):
Something like this should work, handling both cases:pqr.,25?xyz|xyz.,25?pqr
pqr.,25?xyz|xyz.,25?pqr
I used Debuggex to test, and I think it's an easy way to show what how the regex is working.
John's answer gives more details on what the specific elements in the regex.
Regular expression visualization
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Have you made any attempt to write such a regular expression yourself yet? Please post the code you've tried
– CertainPerformance
Aug 27 at 1:50