Saturday, September 18, 2004

Of XML tree models and document loading times in Python and Java

Many moons ago, before SAX and DOM, I wrote a dual event/tree-based
library for SGML processing in Python called Pyxie. It has gone through a bunch of iterations. An XML version ( About 3 main mutations that have popped up in various Propylon projects. A Pyxie2 that works both on Jython and Python that came about as part of XPipe...

Whatever Pyxie variant I am using, I am in the habit of serializing Pyxie trees to disk along with XML files. I treat these serialized Pyxie trees as "pre-compiled" artefacts that I use solely as a performance trick that I can slot
into my apps without contorting the design.

At load time, I check for the existence of a Pyxie serialization. If
there is one and if the datestamps indicate that it is younger than
the corresponding XML file, then I load it. Otherwise, I load the XML
file directly. At save time, I always save out the Pyxie tree as well
as the native XML.

It works a treat. Dramatically improving the performance of IO bound applications without sacrificing any of the beefy goodness of the XML underneath. I can even modify XML files in running apps, safe in the knowledge that at load time, the XML version will be parsed and the stale "compiled" version thrown away.

Recently, I have had cause to look into similar trickery in Java. I
read the following about Xerces
    "In addition, some rough measurements have shown that XML
    serialization performs better than Java object serialization, and that
    XML instance documents require less storage space than
    object-serialized DOMs."


Intrigued I wrote some test applications to save/load serialized
Xerces DOMs. Unfortunately, my investigations have just run into a
road block. It appears that the Xerces serialization recurses on a
node by node basis.

Translated into English, it blows the stack after 1000 nodes or so.
So, I'm a bit stuck. In order to get decent timings to compare XML
load time from serialized DOM load time, I need big docs. If my doc
has any more than 1000 nodes (roughly speaking) Xerces cannot persist

Perhaps I'm missing something?

No comments: