library for SGML processing in Python called Pyxie. It has gone through a bunch of iterations. An XML version (http://pyxie.sourceforge.net). About 3 main mutations that have popped up in various Propylon projects. A Pyxie2 that works both on Jython and Python that came about as part of XPipe...
Whatever Pyxie variant I am using, I am in the habit of serializing Pyxie trees to disk along with XML files. I treat these serialized Pyxie trees as "pre-compiled" artefacts that I use solely as a performance trick that I can slot
into my apps without contorting the design.
At load time, I check for the existence of a Pyxie serialization. If
there is one and if the datestamps indicate that it is younger than
the corresponding XML file, then I load it. Otherwise, I load the XML
file directly. At save time, I always save out the Pyxie tree as well
as the native XML.
It works a treat. Dramatically improving the performance of IO bound applications without sacrificing any of the beefy goodness of the XML underneath. I can even modify XML files in running apps, safe in the knowledge that at load time, the XML version will be parsed and the stale "compiled" version thrown away.
Recently, I have had cause to look into similar trickery in Java. I
read the following about Xerces
- "In addition, some rough measurements have shown that XML
serialization performs better than Java object serialization, and that
XML instance documents require less storage space than
object-serialized DOMs."
Eh?
Intrigued I wrote some test applications to save/load serialized
Xerces DOMs. Unfortunately, my investigations have just run into a
road block. It appears that the Xerces serialization recurses on a
node by node basis.
Translated into English, it blows the stack after 1000 nodes or so.
So, I'm a bit stuck. In order to get decent timings to compare XML
load time from serialized DOM load time, I need big docs. If my doc
has any more than 1000 nodes (roughly speaking) Xerces cannot persist
it.
Perhaps I'm missing something?