Sean McGrath: What is a document?

Friday, January 26, 2018

What is a document? - Part 5

Previously: What is a document? - part 4.

In the early Nineties, I found myself tasked with the development of a digital guide to third level education in Ireland. The digital product was to be an add-on to a book based product, created in conjunction with the author of the book. The organization of the book was very regular. Each third level course had a set of attributes such as entry level qualifications, duration, accrediting institution, physical location of the campus, fees and so on. All neatly laid out, on page per course, with some free-flowing narrative at the bottom of each page. The goals of the digital product were to allow prospective students to search based on different criteria such as cost ranges, course duration and so on.

Step number one was getting the information from the paper book into a computer and it is in this innocuous sounding step that things got very interesting. The most obvious approach - it seemed to me at the time - was to create a programmable database – in something like Clipper (a database programming language that was very popular with PC developers at the time). Tabular databases were perfect for 90% of the data – the “structured” parts such as dates, numbers, short strings of text. However, the tabular databases had no good way of dealing with the free-flowing narrative text that accompanied each course in the book. It had paragraphs, bulleted lists, bold/italics and underline...

An alternative approach would be to start with a word-processor – as opposed to a database – as it would make handling the free-flowing text (and associated formatting, bold/italic, bulleted lists etc.) easy. However, the word processor approach did not make it at all easy to process the “structured” parts in the way I wanted to (in many cases, the word processors of the day stored information in encrypted formats too).

My target output was a free viewer that came with Windows 3.1 known as Windows Help. If I could make the content programmable, I reasoned, I could automatically generate all sorts of different views of the data as Windows Help files and ship the floppy disk without needing to write my own viewer. (I know this sounds bizarre now but remember this work predated the concept of a generic web browser by a few years!)

I felt I was facing a major fork in the road in the project. By going with a database, some things were going to be very easy but some very awkward. By going with a document instead...same thing. Some things easy, some very awkward. I trawled around in my head for something that might have the attributes of a database AND of a document at the same time.

As luck would have it, I had a Byte Magazine from 1992 on a shelf. It had an article by Jon Udell that talked about SGML - Standard Generalized Markup Language. It triggered memories of a brief encounter I had had with SGML back in Trinity College when Dr. David Abrahamson had referencing it in his compiler design course, back in 1986. Back then, SGML was not yet an ISO standard (it became one in 1987). I remember in those days hearing about “tagging" and how an SGML parser could enforce structure – any structure you liked – on text – in a similar way to programming language parsers enforced structure on, say, Pascal source code.

I remember thinking “surely if SGML can deal with the hierarchical structures like you typically find in programming languages, it can deal with the simpler, flatter structures you get in tabular databases?”. If it could, I reasoned, then surely I could get the best of both worlds. My own data format that had what I needed from database-approaches but also what I needed from document approaches to data modelling?

I found – somehow (this is all pre-internet remember. No Googling for me in those days.) – an address in Switzerland that I could send some money to in the form of a money order, to get a 3.5 inch floppy back by return post, with an SGML parser on it called ArcSGML. I also found out about an upcoming gathering in Switzerland of SGML enthusiasts. A colleague, Neville Bagnall went over and came back with all sorts of invaluable information about this new thing (to us) called generalized markup.

We set to work in earnest. We created our first ever SGML data model. Used ArcSGML to ensure we were getting the structure and consistency we wanted in our source data. We set about inventing tags for things like “paragraph”, “bold”, “cross-reference” as well as the simpler field-like tags such as “location”, “duration” etc. We sent about looking at ways to process the resultant SGML file. The output from ArcSGML was not very useful for processing, but we soon found out about another SGML parser called SGMLS by Englishman James Clark. We got our hands on it and having taken one look at the ESIS format it produced, we fell in love with it. Now we had a tool that could validate the structure of our document/database and feed us a clean stream of data to process downstream in our own software.

Back then C++ was our weapon of choice. Over time our code turned into a toolkit of SGML processing components called IDM (Intelligent Document Manager) which we applied to numerous projects in what became known as the “electronic publishing era”. Things changed very rapidly in those days. The floppy disks gave way to the CD-ROMs. We transitioned from Windows Help files to another Microsoft product called Microsoft Multimedia Viewer. Soon the number of “viewers” for electronic books exploded and we were working on Windows Help, Multimedia Viewer, Folio Views, Lotus Notes to name but four.

As the number of distinct outputs we needed to generate grew, so too did the value of our investment getting up to speed with SGML. We could maintain a single source of content but generate multiple output formats from it, each leveraging the capabilities of the target viewer in a way that made them look and feel like they had been authored directly in each tool as opposed to programmatically generated for them.

My concept of a “document” changed completely over this period. I began to see how formatting – and content – could be separated from each other. I began to see how in so doing, a single data model could be used to manage content that is tabular (like a classic tabular database) as well as content that is irregular, hierarchical, even recursive. Moreover, I could see how keeping the formatting out of the core content made it possible to generate a variety of different formatting “views” of the same content.

It would be many years later that the limitations of this approach became apparent to me. Back then, I thought it was a completely free lunch. I was a fully paid-up convert to the concept of generalized markup and machine readable, machine validatable documents. As luck would have it, this coincided with the emergence of a significant market for SGML and SGML technologies. Soon I was knee deep in SGML parsers, SGML programming languages, authoring systems, storage systems and was developing more and more of our own tools, first in C++, then Perl, then Python.

The next big transition in my thinking about documents came when I needed to factor non-technical authors into my thinking. This is where I will turn next. What is a document? - Part 6.

1 comment:

Unknown said...: Learning by doing! I can see the 'transition' you're talking about.
Excellent piece Sean.; 12:06 AM

Sean McGrath

Featured Post

Linkedin

Friday, January 26, 2018

What is a document? - Part 5

1 comment:

Blog Archive