desrializing XML from a TCP port

**NickFitz** · 11 June 2010, 03:04

Off the top of my head...

To identify boundaries between objects you could look for <?xml version="1.0" encoding="utf-8"?>: the point between the character preceding the first character of that and the first character of that delimits both the end of one object and the start of a new one.
To identify incomplete object representations you can, after splitting things into chunks on the basis of the above and processing all chunks that aren't last, check to see if the last chunk's opening tag is closed: grab the first bit (<SomeObject xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">) and use that to construct a string representing the appropriate end tag (</SomeObject>) then check to see if that's present before the end of the input; if not, wait for further input, append it to that chunk, and go back to step 1.

Of course a simple string match or RegExp test for step 2 will fail if there are any circumstances in which root objects of some type can contain objects of the same type at some level of nesting (e.g. <foo><bar><baz><foo></foo>[EOF] would fail).

There could be some more elegant solution to step 2 that involving piping what you have to a SAX parser which will happily wait until it gets the remaining chunk of the document once it arrives on your input stream. Trying to do it with RegExp definitely falls into the "now you have two problems" category, as you'll have signed up to write an XML parser using RegExp, which isn't possible in the first place (XML being a Chomsky Type 2 language (context-free), and regular expressions only being able to handle Chomsky Type 3 languages (regular)). That said, if your case is suitably constrained - and will remain so in the future - then you might be able to get way with RegExp.

In your place, I'd be looking at step 1 to kick things off and then hoping that a suitably well-behaved SAX parser would take care of step 2.

**ASB** · 11 June 2010, 08:04

Originally posted by NickFitz View Post

Off the top of my head...

To identify boundaries between objects you could look for <?xml version="1.0" encoding="utf-8"?>: the point between the character preceding the first character of that and the first character of that delimits both the end of one object and the start of a new one.
To identify incomplete object representations you can, after splitting things into chunks on the basis of the above and processing all chunks that aren't last, check to see if the last chunk's opening tag is closed: grab the first bit (<SomeObject xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">) and use that to construct a string representing the appropriate end tag (</SomeObject>) then check to see if that's present before the end of the input; if not, wait for further input, append it to that chunk, and go back to step 1.

Of course a simple string match or RegExp test for step 2 will fail if there are any circumstances in which root objects of some type can contain objects of the same type at some level of nesting (e.g. <foo><bar><baz><foo></foo>[EOF] would fail).

There could be some more elegant solution to step 2 that involving piping what you have to a SAX parser which will happily wait until it gets the remaining chunk of the document once it arrives on your input stream. Trying to do it with RegExp definitely falls into the "now you have two problems" category, as you'll have signed up to write an XML parser using RegExp, which isn't possible in the first place (XML being a Chomsky Type 2 language (context-free), and regular expressions only being able to handle Chomsky Type 3 languages (regular)). That said, if your case is suitably constrained - and will remain so in the future - then you might be able to get way with RegExp.

In your place, I'd be looking at step 1 to kick things off and then hoping that a suitably well-behaved SAX parser would take care of step 2.

Thanks Nick,

I think I had pretty much come to the same conclusion. I am in the position where I can parse it using the sort of methods you describe. However it is inherently a bit fragile and is constrained by the actual data. In any event its probably the best I can do.

Cheers.

**ASB** · 12 June 2010, 13:49

And the answer is...

They can send it with a packet prefix and suffix. So the parsing code is simple (but does still rely on it not being in any of the data).

**NickFitz** · 13 June 2010, 02:23

Originally posted by ASB View Post

They can send it with a packet prefix and suffix. So the parsing code is simple (but does still rely on it not being in any of the data).

An algorithm for generating multipart boundaries for MIME ought to take care of that. This chap suggests using an MD5 hash of the timestamp of the message included in some boilerplate text, which sounds like a viable approach.

**VectraMan** · 13 June 2010, 14:01

I implemented something just like this (but not .NET) with a SAX parser. As Nick says, the parser just waits for the input. I can't remember the name of the open source C++ library I used, but there seems to be a sax .NET on SourceForge which presumably works much the same.

But if you're determined to parse it: XML is pretty simple. Just looking for <'s, </'s and >'s ought to do it.

**NickFitz** · 13 June 2010, 16:03

Originally posted by VectraMan View Post

But if you're determined to parse it: XML is pretty simple. Just looking for <'s, </'s and >'s ought to do it.

Code:

<element attribute="hello>/>>">
    <![CDATA[
        The expression "1 < 2" is true.
    ]]>
</element>

**VectraMan** · 13 June 2010, 20:00

Originally posted by NickFitz View Post

Code:

<element attribute="hello>/>>">
    <![CDATA[
        The expression "1 < 2" is true.
    ]]>
</element>

I didn't add "excluding anything within quotes" because I felt that was so obvious as to be insulting the intelligence of the reader.

Are <'s and >'s allowed inside quotes, or is it just CDATA and attributes? I thought that all <'s and >'s were escaped to &lt and &gt in XML.

**NickFitz** · 13 June 2010, 21:31

Originally posted by VectraMan View Post

Are <'s and >'s allowed inside quotes, or is it just CDATA and attributes? I thought that all <'s and >'s were escaped to &lt and &gt in XML.

The only characters that have to be escaped (other than within a comment, a processing instruction, a CDATA section, or an internal entity declaration) are "<" and "&". ">" only has to be escaped when used in the string "]]>" other than for the purpose of marking the end of a CDATA section. Most tools escape ">" anyway though.

XML 1.0 section 2.4 Character Data and Markup

desrializing XML from a TCP port