Creating a regex to match the following scenario

Creating a regex to match the following scenario



I am a student working on a small research project where I need to scrape web pages which match the following requirement :-
If word X say "abc" is found anywhere in the text, look for pattern Y say "pqr" if it occurs within a 25 character window on either side of the occurrence of X.
Eg.



pqrxyz is valid.


pqrxyz



xyz is invalid.


xyz



xyzpqr is valid.


xyzpqr



pqr123456789123456789123456789xyz is invalid.


pqr123456789123456789123456789xyz



I can't figure this out. Any help will be greatly appreciated.


((?=pqr).20,abc) | (pqr20,(?!abc))



This is my attempt so far. I don't know how to incorporate the 20 character window constraint.





Have you made any attempt to write such a regular expression yourself yet? Please post the code you've tried
– CertainPerformance
Aug 27 at 1:50





You may use r'pqr.0,25xyz|xyz.0,25pqr'
– anubhava
Aug 27 at 2:19


r'pqr.0,25xyz|xyz.0,25pqr'




2 Answers
2



. is the regular expression for "any single character."


.



n,m is the regular expression for "at least n, and no more than m, repetitions of the previous regular expression."


n,m



So, the regular expression xyz.0,25pqr means "xyz, followed by up to twenty-five characters, followed by pqr".


xyz.0,25pqr


xyz


pqr



So, accounting for the possibility of pqr occurring before or after xyz, we get this line of python code:


pqr


xyz


if re.search('pqr.0,25xyz', line) or re.search('xyz.0,25pqr', line):



Something like this should work, handling both cases:
pqr.,25?xyz|xyz.,25?pqr


pqr.,25?xyz|xyz.,25?pqr



I used Debuggex to test, and I think it's an easy way to show what how the regex is working.



John's answer gives more details on what the specific elements in the regex.



Regular expression visualization






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.