Friday, February 09, 2018

What is a document? - part 6



By the late Nineties, I was knee deep in the world of XML and the world of Python, loving the way that these two amazing tools allowed tremendous amounts of automation to be brought to traditionally labor intensive document processing/publishing tasks. This was boom time in electronic publishing and every new year brought with it a new output format to target: Microsoft Multimedia Viewer, Windows Help, Folio Views, Lotus Notes and a whole host of proprietary formats we worked on for clients. Back then, HTML was just another output for us to target. Little did we know back then that it would eclipse all the others.

Just about twenty years ago now - in the fall of 1998 - , I co-presented a tutorial on XML at the International Python Conference in Houston, Texas. [1]. At that same conference, I presented a paper on high volume XML processing with Python [2]. Back in those days, we had some of the biggest corpora of XML anywhere in the world, here in Ireland. Up to the early/mid oozies, I did a lot of conference presentations and become associated with the concept of XML processing pipelines[3].

Then a very interesting thing happened. We began to find ourselves working more and more in environments where domain experts –not data taggers or software developers – needed to create and update XML documents. Around this time I was also writing books on markup languages for Prentice Hall[4] and had the opportunity to put “the shoe on the other foot” so-to-speak, and see things from an authors perspective.

It was then that I experienced what I now consider to be a profound truth of the vast majority of documents in the world - something that gets to the heart of what a document actually is which distinguishes it from other forms of digital information. Namely, that documents are typically very “structured” when they are finished but are highly unstructured when then are being created or in the midst of update cycles.

I increasingly found myself frustrated with XML authoring tools that would force me to work on my document contents in a certain order and beep at me unless my documents were “structured” at all times. I confess there were many times when I abandoned structured editors for my own author/edit work with XML and worked in the free-flowing world of the Emacs text editor or in word processors with the tags plainly visible as raw text.

 I began to appreciate that the ability to easily create/update content is a requirement that must be met if the value propositions of structured documents are to be realized, in most cases. There is little value in a beautifully structured, immensely powerful back-end system for processing terabytes of documents coming in from domain experts unless said domain experts are happy to work with the author/edit tools.

For a while, I believed it was possible to get to something that authors would like, by customizing the XML editing front-ends. However, I found that over and over again, two things started happening, often in parallel. Firstly, the document schemas became less and less structured so as to accommodate the variations in real-world documents and also to avoid “beeping” at authors where possible. Secondly, no amount of GUI customization seemed to be enough for the authors to feel comfortable with the XML editors.

“Why can't it work like Word?” was a phrase that began to pop up more and more in conversations with authors. For quite some time, while Word's file format was not XML-based, I would look for alternatives that would be Word-like in terms of the end-user experience, but with file formats I could process with custom code on the back end.

For quite a few years, StarOffice/OpenOffice/LibreOffice fitted the bill and we have had a lot of success with it. Moreover, it allowed for levels of customization and degrees of business-rule validation that XML schema-based approaches cannot touch. We learned may techniques and tricks over the years to guide authors in the creation of structured content without being obtrusive and interrupting their authoring flow. In particular, we learned to think about document validation as a function that the authors themselves have control over. They get to decide when their content should be checked for structural and business rules – not the software.

Fast forward to today. Sun Microsystems is no more. OpenOffice/LibreOffice do not appear to be gaining the traction in the enterprise that I suspected they would a decade ago. Googles office suite – ditto. Native, browser based document editing (such as W3C's Amaya [5]) does not appear to be getting traction either....

All the while, the familiar domain expert/author's mantra rings in my ears “Why can't it work like Word?”

As of 2018, this is a more interesting question than it has ever been in my opinion. That is where we will turn next.