Thursday, January 10, 2008

Binary XML solves the wrong problem

Jimmy Zhang hits the nail on the head. The real issue is object allocation.

The real culprit for all the object allocation, ultimately, is XML's variable-width-everything : element types, attributes & text. This results in a boatload of discrete malloc() operations and a whole lot of "pointer pointing" to link parents to children, nodes to siblings etc. when the graph structure is read into memory. Memory managers hate lots-and-lots of little objects.

I'm fine with this. XML's variable-width-everything is the key feature as far as I am concerned. Not a bug. The 3 biggies to keep in mind in my experience are:

1 - Zip the XML for transmission and zip it for storage too if you like. Modern compressions algorithms are so beastly good that the can do better than you could yourself if you hand-crafted your own "efficient" storage/transmission format. Besides, you have better things to be doing than writing compressors/de-compressors and cunningly devilishy-difficult-to-debug-and-maintain custom notations. (Note that http groks gzip. I frequently encounter developers who don't know that.)

2 - Bend over backwards to avoid repeated loading of the XML from scratch -with all the malloc operations it entails. Don't fall for the common[1] misconception that saving/transmitting a binary/marshaled/pickeled form will lead to fast re-loading. (These things end up calling malloc() too you know :-) Memory-based caches of "cooked" data structures are your friend.

3 - if you know for sure that every bit of every byte is precious bandwidth on the wire or on disk; and if you are a happy that this truly is the bottlekneck in your application, then perhaps XML is not right for you. But beware that CSV or JSON or any other format with unpredictable variabilities in "record" length will have the same malloc issue at the end of the day.

In a perfect world what would I do? I'd introduce a "profile" of XML 1.0 that allowed XML data to signal to XML parsers/processors key stats about the data such as maximum required #PCDATA node size, that sort of thing. It could be done with a PI or an attribute or a
element. In a webby way, it could be signalled out of band in an HTTP header.

Armed with that, a processor could pre-alloc a whole bunch of fixed-width blocks of RAM for nodes in one fell swoop. Apps doing read-only work with the XML would have the added benefit of not having to worry about in-memory mods to the tree : a key thorn in the side of APIs like the DOM. Just allocate a big slab of RAM and start pouring nodes into it as you need them.

That would, I think, address the real issue without throwing the oft-vaunted-and-thoroughly-justified benefits of XML out the window.

In a perfect world I would have the time to go gather real performance data and write up a conference paper with the results. I don't live in that ideal world unfortunately. If anybody fancies it, I'd be happy to collaborate by sharing experiences of what I have seen happen in real world XML applications that leads me to believe that this hypothesis has legs.

On related notes, how weird is it that we have not moved on from the DOM and SAX in terms of "standard" APIs for XML processing? I'd love to see a read-only DOM (lots of apps use DOM but only need read - not read/write access to the tree.) Knowing that the game is read-only would allow a DOM implementation to do a lot of interesting things from a performance perspective. It has been kwown for ages that a forward-only XPath is a very useful thing. Maybe it is being worked on. Maybe thes things exist and I'm just out of the loop a bit at the moment?

[1] I fell for it. To my embarrassment, I fell for it twice!

6 comments:

richmoore said...

Every tag in XML will occur at least twice (once for open and once for closing) and commonly will occur many times in a document. The same applies to attribute names. This means that by storing these in a table and simply storing the index in your data structure you should be able to get a significant win.

Ian Bicking said...

If you want to avoid malloc's, maybe you could load the document into memory entirely and then create a node structure that points to the document.

Of course you change your object model in the process which might be too much of a change. But it seems like it should be a fast way to do parsing if you are willing to commit to a fully in-memory document (not a streamed parsing where you might throw away pieces before you are done parsing).

And anyway, are allocations really that expensive? You probably know the total document length, and you can allocate a bunch based on that, then use a more efficient allocator to fill in that memory with individual pieces.

richmoore: creating a table of strings and pointing to them is almost exactly what gzip does.

richmoore said...

@ian
Yes, my point was that you should use a lookup table for your in memory data structure to avoid allocating an extra string for each tag.

M. David Peterson said...

     "Maybe it is being worked on. Maybe thes things exist and I'm just out of the loop a bit at the moment?"

Do you mean from a standards driven perspective or from a pure implementation perspective? Speaking purely from an implementation perspective, there's plenty of read-only and write-only implementations (e.g. System.Xml.XmlReader and System.Xml.XmlWriter on .NET for read only and write only respectively.)

     "Memory-based caches of "cooked" data structures are your friend."

Absolutely! There's nothing better than the combination of ETag's using file system attributes (e.g. last modified, file size, inode's on Unix, etc.) and in memory hashtables for storing and retrieving pre-cooked XmlReader's. For application performance, I live and die by this method > http://nuxleus.googlecode.com/svn/trunk/nuxleus/Source/Nuxleus.Web/XmlServiceOperation/XmlServiceOperationManager.cs

kll said...

@richmoore: gzip perfectly removes this redundancy. With todays CPUs, even on low-end machine, gzip decompression can be done on-the-fly while reading from disk, with no noticeable penalty.

I don't quite agree with the post. If you open XML as read-only (or allow for memory inefficiency) you can pre-allocate large chunks of memory and use them as a pool for strings (just advance pointer to end of allocated area, don't track individual fragments).

Martin Probst said...

If you know that you're doing forward only XPath you can omit some pointers from the DOM tree structure. Except for that, it doesn't really give you a speed bonus (assuming that you have a competent XPath implementation), because the implementation will choose a proper evaluation strategy for your forward only XPath expression.

At least in Java world, malloc() really isn't the problem. With a compacting garbage collector, malloc() boils down to increasing a pointer, and that's it. That's one reason why some Java implementations outperform the equivalent C++ libraries for XML stuff.

You can save a lot of work with a proper binary XML format, chosen to fit your particular implementation needs. But that's really difficult, and it's much more easy to shoot yourself into your foot and actually end up with something slower. Plus, it's going to be buggy.

I really don't buy that the XML parsing/processing is the major bottleneck in many applications, in particular if you avoid serialization/deserialization as much as possible, as you rightfully recommend. As soon as you do anything meaningful with the XML, parsing will stop being your bottleneck. Additionally, you can easily spread out the parsing of incoming messages over many machines, which is much harder for the rest of the "business logic" and the database.