Creating a regex to match the following scenario

Creating a regex to match the following scenario



I am a student working on a small research project where I need to scrape web pages which match the following requirement :-
If word X say "abc" is found anywhere in the text, look for pattern Y say "pqr" if it occurs within a 25 character window on either side of the occurrence of X.
Eg.



pqrxyz is valid.


pqrxyz



xyz is invalid.


xyz



xyzpqr is valid.


xyzpqr



pqr123456789123456789123456789xyz is invalid.


pqr123456789123456789123456789xyz



I can't figure this out. Any help will be greatly appreciated.


((?=pqr).20,abc) | (pqr20,(?!abc))



This is my attempt so far. I don't know how to incorporate the 20 character window constraint.





Have you made any attempt to write such a regular expression yourself yet? Please post the code you've tried
– CertainPerformance
Aug 27 at 1:50





You may use r'pqr.0,25xyz|xyz.0,25pqr'
– anubhava
Aug 27 at 2:19


r'pqr.0,25xyz|xyz.0,25pqr'




2 Answers
2



. is the regular expression for "any single character."


.



n,m is the regular expression for "at least n, and no more than m, repetitions of the previous regular expression."


n,m



So, the regular expression xyz.0,25pqr means "xyz, followed by up to twenty-five characters, followed by pqr".


xyz.0,25pqr


xyz


pqr



So, accounting for the possibility of pqr occurring before or after xyz, we get this line of python code:


pqr


xyz


if re.search('pqr.0,25xyz', line) or re.search('xyz.0,25pqr', line):



Something like this should work, handling both cases:
pqr.,25?xyz|xyz.,25?pqr


pqr.,25?xyz|xyz.,25?pqr



I used Debuggex to test, and I think it's an easy way to show what how the regex is working.



John's answer gives more details on what the specific elements in the regex.



Regular expression visualization






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Crossroads (UK TV series)

ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế