Saturday, September 18, 2004

Of XML tree models and document loading times in Python and Java

Many moons ago, before SAX and DOM, I wrote a dual event/tree-based
library for SGML processing in Python called Pyxie. It has gone through a bunch of iterations. An XML version (http://pyxie.sourceforge.net). About 3 main mutations that have popped up in various Propylon projects. A Pyxie2 that works both on Jython and Python that came about as part of XPipe...

Whatever Pyxie variant I am using, I am in the habit of serializing Pyxie trees to disk along with XML files. I treat these serialized Pyxie trees as "pre-compiled" artefacts that I use solely as a performance trick that I can slot
into my apps without contorting the design.

At load time, I check for the existence of a Pyxie serialization. If
there is one and if the datestamps indicate that it is younger than
the corresponding XML file, then I load it. Otherwise, I load the XML
file directly. At save time, I always save out the Pyxie tree as well
as the native XML.

It works a treat. Dramatically improving the performance of IO bound applications without sacrificing any of the beefy goodness of the XML underneath. I can even modify XML files in running apps, safe in the knowledge that at load time, the XML version will be parsed and the stale "compiled" version thrown away.

Recently, I have had cause to look into similar trickery in Java. I
read the following about Xerces
    "In addition, some rough measurements have shown that XML
    serialization performs better than Java object serialization, and that
    XML instance documents require less storage space than
    object-serialized DOMs."

Eh?

Intrigued I wrote some test applications to save/load serialized
Xerces DOMs. Unfortunately, my investigations have just run into a
road block. It appears that the Xerces serialization recurses on a
node by node basis.

Translated into English, it blows the stack after 1000 nodes or so.
So, I'm a bit stuck. In order to get decent timings to compare XML
load time from serialized DOM load time, I need big docs. If my doc
has any more than 1000 nodes (roughly speaking) Xerces cannot persist
it.

Perhaps I'm missing something?




Getting organised for XML Open 2004

The XML Open Conference 2004 takes place next week in Cambridge, England. I'm doing a presentation on XML pipelining and also the closing keynote. I'm looking forward to it. I'm working on my keynote presentation this weekend.

My last keynote in Cambridge used a lot of visuals. Many pictures, few words, few bullet points. I'm adopting a similar style this time around.

Boy does it take time to create a highly visual Powerpoint! And hey, lets hear it for http://images.google.com without which... etc.

Thursday, September 16, 2004

Hacking XML

Recently I contributed a couple of XML hacks to an O'Reilly book of the same name. One of these hacks - using SGML to help auto-tag XML - is covered in a new article by Mike Fitztgerald : http://www.xml.com/pub/a/2004/09/15/XMLHacks.html.

Wednesday, September 15, 2004

RFID and the re-birth of M-Commerce

Join RFID technology with the concept of M-Commerce and something interesting happens.

Firefox and Thunderbird - get them. No questions. Just do it.

The Firefox/Thunderbird Browser/E-mail client combination is wonderful, keenly priced [:-)] and getting better with every release.
Just get them. Okay?

Monday, September 13, 2004

Python/XML talent required

Propylon are looking for Python/XML programmers. Opportunities in both Dublin and Sligo offices. Contact me if interested.