Sean McGrath: 01/06/2008

Friday, January 11, 2008

Thursday, January 10, 2008

Binary XML solves the wrong problem

Jimmy Zhang hits the nail on the head. The real issue is object allocation.

The real culprit for all the object allocation, ultimately, is XML's variable-width-everything : element types, attributes & text. This results in a boatload of discrete malloc() operations and a whole lot of "pointer pointing" to link parents to children, nodes to siblings etc. when the graph structure is read into memory. Memory managers hate lots-and-lots of little objects.

I'm fine with this. XML's variable-width-everything is the key feature as far as I am concerned. Not a bug. The 3 biggies to keep in mind in my experience are:

1 - Zip the XML for transmission and zip it for storage too if you like. Modern compressions algorithms are so beastly good that the can do better than you could yourself if you hand-crafted your own "efficient" storage/transmission format. Besides, you have better things to be doing than writing compressors/de-compressors and cunningly devilishy-difficult-to-debug-and-maintain custom notations. (Note that http groks gzip. I frequently encounter developers who don't know that.)

2 - Bend over backwards to avoid repeated loading of the XML from scratch -with all the malloc operations it entails. Don't fall for the common[1] misconception that saving/transmitting a binary/marshaled/pickeled form will lead to fast re-loading. (These things end up calling malloc() too you know :-) Memory-based caches of "cooked" data structures are your friend.

3 - if you know for sure that every bit of every byte is precious bandwidth on the wire or on disk; and if you are a happy that this truly is the bottlekneck in your application, then perhaps XML is not right for you. But beware that CSV or JSON or any other format with unpredictable variabilities in "record" length will have the same malloc issue at the end of the day.

In a perfect world what would I do? I'd introduce a "profile" of XML 1.0 that allowed XML data to signal to XML parsers/processors key stats about the data such as maximum required #PCDATA node size, that sort of thing. It could be done with a PI or an attribute or a

element. In a webby way, it could be signalled out of band in an HTTP header.

Armed with that, a processor could pre-alloc a whole bunch of fixed-width blocks of RAM for nodes in one fell swoop. Apps doing read-only work with the XML would have the added benefit of not having to worry about in-memory mods to the tree : a key thorn in the side of APIs like the DOM. Just allocate a big slab of RAM and start pouring nodes into it as you need them.

That would, I think, address the real issue without throwing the oft-vaunted-and-thoroughly-justified benefits of XML out the window.

In a perfect world I would have the time to go gather real performance data and write up a conference paper with the results. I don't live in that ideal world unfortunately. If anybody fancies it, I'd be happy to collaborate by sharing experiences of what I have seen happen in real world XML applications that leads me to believe that this hypothesis has legs.

On related notes, how weird is it that we have not moved on from the DOM and SAX in terms of "standard" APIs for XML processing? I'd love to see a read-only DOM (lots of apps use DOM but only need read - not read/write access to the tree.) Knowing that the game is read-only would allow a DOM implementation to do a lot of interesting things from a performance perspective. It has been kwown for ages that a forward-only XPath is a very useful thing. Maybe it is being worked on. Maybe thes things exist and I'm just out of the loop a bit at the moment?

[1] I fell for it. To my embarrassment, I fell for it twice!

Sean McGrath

Featured Post

Linkedin

Friday, January 11, 2008

Warning Signs for your configuration files

Thursday, January 10, 2008

Binary XML solves the wrong problem

Tuesday, January 08, 2008

Knowledge capture in a webby world

Blog Archive