Python Regex for XML like data string [duplicate]

This question already has an answer here:

(The links provided in the asked before do not provide answers applicable to this question. So it has been incorrectly flagged).

I have a string like so:

s = '"<w s='<v1>'><m><v2>v3</m><v name='x'><v4></v></w>"'

The the single quotes around attribute s of node w, have to be escaped for Python. I know 'normal' XML would escape chevrons with < and >, but I am dealing with a data conversion, so I have to deal with the data I am given.

<

>

I have escaped the single quotes like so ' just so I can try out some test cases.

'

What I am trying to achieve is find the bits in chevrons that don't have a paired closing tag (</ >) (i.e I want v1, v2 and v4, (v3 -isn't required, since it's not in chevrons - it's the inner text).

(</ >)

I have to embellish those by putting extra characters around them, leaving out the chevrons.

So I'd have $$v1 etc.

$$v1

I need to leave the original XML parts (paired opening/closing tags) intact.

The final converted string should look like so:

'"<w s='$$v1'><m>$$v2v3</m><v name='x'>$$v4</v></w>"'

Python's XML ElementTree couldn't deal with the chevron in element attributes.

lxml although better would also give me issues.

Either way I'd end up with malformed XML or mismatched tags.

So I have resorted to trying a number of things like using Regex's or parsing string. Things quickly descend into a mess.

One of the obstacles I am facing, is if the closing chevron is within a quoted string I have to ignore it and go on to the next chevron, to get to the end of the w XML node.

w

Overall, I don't actually care if Regex is not used.

But, I've been using regex101.com to try out some Regex's:

This didn't seem to work:

<(?P<tag>S)(.*?)>(.*?)</(P=tag)>

This seems better:

</?(?P<tag>S)(.*?)>

But, the quoted chevron should be ignored, and the regex isn't quite greedy enough...

~~Is there a way to do this with Regex?~~

EDIT: I will continue to add my findings here... And I will post a complete custom parser solution for others following in similar footsteps. Have had some time to research the Regexes further over course of a weekend...

This Regex for example also gives left chevrons that aren't preceded by a single quote:

(Negative Lookbehind)

((?<!['])<)

This Regex gives right chevrons that aren't followed by single quote:

(Negative Lookahead)

(>(?![']))

Here is a pattern that can be applied recursively when creating a parser and accommodates the quoting rules. You can check it out here:

(?P<opentag>((?P<oc>(?<!['])<)(?P<tag>S)s*(?P<attr>.*?)(?P<cc>>(?![']))))(?P<inner>.*?)(?P<closetag>(?P=oc)/(?P=tag)(?P=cc))

(?P<name>) - names a capture group, you can later reference with this syntax (?P=name).

(?P<name>)

(?P=name)

You have to escape / symbol with the Regex so you get /? - the ? makes it optional

/

/?

?

S means non space

S

.*? means any char, zero or more times. The ? on the end prevents greedy expansion.

.*?

?

These are other types of scenario you have to deal with:

p2 = '<a><c></c></a>' p3 = '<a><c></c></a>' p4 = '<a name='<var>'></a>' p5 = '<a><var><var></a>' p6 = '<a></var><var></a>' p7 = '<a></a>'

For p6, it should leave it as <a></var>$$var</a>.

p6

<a></var>$$var</a>

Here's a link to a Folder on Google Drive, with 4 other versions I've tried

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

Better use a real XML parser
– Gilles Quenot
Aug 23 at 11:23

@GillesQuenot As I've already mentioned ... They give malformed XML... So that's a no go and I am working on a data conversion with large quantities of this stuff. Plus sometimes string starts off with double quotes etc etc..
– JGFMK
Aug 23 at 11:24

is it normal that in your example ` name='x'` was not transformed as ` name='$$x'`
– Erwan
Aug 23 at 12:12

Yes. The $$var name represents the syntax of a variable substitution in the system I am converting to. In the original system variables substitution was done via <var name> - except there are scenarios where you have xml nodes that have to 'stay behind'. The matching sequenced pairs of opening/closing tags are the way to resolve that (i.e. p7 1st opening b tag should match first closing - not last etc)
– JGFMK
Aug 23 at 13:17

BeautifulSoup can parse malformed XML, but I'm afraid it would "fix" it for you. Still, maybe an option to consider.
– Aaron
Aug 23 at 13:35

2 Answers
2

Check on the next regex <(?P<tag>[^<s/>]+)>(?!.*</(?P=tag)>).

<(?P<tag>[^<s/>]+)>(?!.*</(?P=tag)>)

To try the regex online and get an explanation, please click here.

Regex is not recommended for XML or "XML like" data:

Instead, consider any of the following alternative approaches:

As I have said, this is a data conversion. i can't suggest the company I am doing this for change their existing data. It works in the tool they are using. I have to map it to something compatible my end that uses a different syntactic sugar. The only thing I can think of at the moment is visually/manually fixing stuff on a one by one basis.
– JGFMK
Aug 23 at 13:58

I wrote this answer knowing that you were doing data conversion, and I didn't say to tell your client to change their data. My intent was to point out the futility of trying to parse an unspecified format. Either carefully specify the input data, or use the tools I list in How to parse invalid (bad / not well-formed) XML? as an initial clean-up pass, and then use XML tools to transform the data to the targeted output form.
– kjhughes
Aug 23 at 14:26

As @Aaron has already mentioned XMLParsers that attempt to fix up XML for things like mismatched tags. That in my scenario would only make matters worse! The link is useful - but I don't see how I can apply it to the problem, short of writing my own custom parser, in which case it doesn't help me at all.
– JGFMK
Aug 23 at 15:50

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review
– Neowizard
Aug 23 at 21:27

You're right that sometimes the correct answer is "there's no solution", but this isn't what you wrote. Your answer doesn't really answer the question. It just says that regex is not recommended, and then suggests 3 options, all of which are not explained.
– Neowizard
Aug 24 at 6:19

搜尋此網誌

Dfyjkt