Previously: What is a document? - Part 5.)
By the late Nineties, I
was knee deep in the world of XML and the world of Python, loving the
way that these two amazing tools allowed tremendous amounts of
automation to be brought to traditionally labor intensive document
processing/publishing tasks. This was boom time in electronic publishing and every new year brought with it a new output format to target: Microsoft Multimedia Viewer, Windows Help, Folio Views, Lotus Notes and a whole host of proprietary formats we worked on for clients. Back then, HTML was just another output for us to target. Little did we know back then that it would eclipse all the others.
Just about twenty years
ago now - in the fall of 1998 - , I co-presented a tutorial on XML at the International
Python Conference in Houston, Texas. [1]. At that same conference, I
presented a paper on high volume XML processing with Python [2]. Back
in those days, we had some of the biggest corpora of XML anywhere in
the world, here in Ireland. Up to the early/mid oozies, I did a lot of conference
presentations and become associated with the concept of XML
processing pipelines[3].
Then a very interesting
thing happened. We began to find ourselves working more and more in
environments where domain experts –not data taggers or software
developers – needed to create and update XML documents. Around this
time I was also writing books on markup languages for Prentice
Hall[4] and had the opportunity to put “the shoe on the other foot”
so-to-speak, and see things from an authors perspective.
It was then that I
experienced what I now consider to be a profound truth of the vast
majority of documents in the world - something that gets to the heart of what a document actually is which distinguishes it from other forms of digital information. Namely, that documents are typically
very “structured” when they are finished but are highly
unstructured when then are being created or in the midst of update
cycles.
I increasingly found
myself frustrated with XML authoring tools that would force me to
work on my document contents in a certain order and beep at me unless my documents were
“structured” at all times. I confess there were
many times when I abandoned structured editors for my own author/edit
work with XML and worked in the free-flowing world of the Emacs text editor or
in word processors with the tags plainly visible as raw text.
I began to appreciate that the ability to easily
create/update content is a requirement that must be met if the value
propositions of structured documents are to be realized, in most
cases. There is little value in a beautifully structured, immensely
powerful back-end system for processing terabytes of documents coming
in from domain experts unless said domain experts are happy to work
with the author/edit tools.
For a while, I believed
it was possible to get to something that authors would like, by customizing
the XML editing front-ends. However, I found that over and over
again, two things started happening, often in parallel. Firstly, the
document schemas became less and less structured so as to accommodate
the variations in real-world documents and also to avoid “beeping”
at authors where possible. Secondly, no amount of GUI customization seemed to be enough for the authors to feel comfortable with the XML
editors.
“Why can't it work
like Word?” was a phrase that began to pop up more and more in conversations with authors. For
quite some time, while Word's file format was not XML-based, I would
look for alternatives that would be Word-like in terms of the
end-user experience, but with file formats I could process with custom
code on the back end.
For quite a few years,
StarOffice/OpenOffice/LibreOffice fitted the bill and we have had a
lot of success with it. Moreover, it allowed for levels of
customization and degrees of business-rule validation that XML
schema-based approaches cannot touch. We learned may techniques and
tricks over the years to guide authors in the creation of structured
content without being obtrusive and interrupting their authoring flow. In
particular, we learned to think about document validation as a
function that the authors themselves have control over. They get to
decide when their content should be checked for structural and
business rules – not the software.
Fast forward to today.
Sun Microsystems is no more. OpenOffice/LibreOffice do not appear to
be gaining the traction in the enterprise that I suspected they would
a decade ago. Googles office suite – ditto. Native, browser based
document editing (such as W3C's Amaya [5])
does not appear to be getting traction either....
All
the while, the familiar domain expert/author's mantra rings in my ears
“Why can't it work like Word?”
As
of 2018, this is a more interesting question than it has ever been in
my opinion. That is where we will turn next.