Thursday, December 14, 2017

What is a document? - Part 2

Back in 1985, when I needed to create a “document” on a computer, I had only two choices. (Yes, I am indeed avoiding trying to define “document” just yet. We will come back to it when we have more groundwork laid for a useful definition.) The first choice involved typing into what is known generically as a “text editor”. Back in those days, US ASCII was the main encoding for text and it allowed for just the basic symbols of letters, numbers and a few punctuation symbols. In those days, the so called “text files” created by these “text editors” could be viewed on screens which typically had 80 columns and 25 rows. They could also be printed onto paper, using either “dot matrix” printers or higher resolution, computerized typewriters such as the so-called “golf ball” typewriters/printers which mimicked a human typist using a ribbon-based impact printer.

The second choice was to wedge the text into little boxes  called "fields" to be stored in a "database". Yes, My conceptual model of text in computers in those early days was a very binary one. (Some nerd humour in the last sentence.)

On one hand, I could type stuff into small “boxes” on a screen which typically resulted in the creation of some form of “structured” data file e.g. a CODASYL database [1]. On the other hand, I could type stuff into an expandable digital sheet of paper without imposing any structure on the text, other than a collection of text characters, often chunked with what we used to call CRLF separators (Carriage Return, Line Feed).

(Aside: You can see the typewriter influence in the terminology here. Return the carriage (holding the print head) to the left of the page. Feed the page upwards by one line. So Carriage Return + Line Feed  = CR/LF).

(Aside:I find the origins of some of this terminology is often news to younger developers who wonder why moving to a new line is two characters instead of one on some machines. Surely “newline” is one thing? Well, it was two originally because one command moves the carriage back (the “CR”) and another command moved the paper up a line “LF”, hence the common pairing: CR/LF. When I explain this I double up by explaining “uppercase/lowercase”. The origins of the latter in particular, are not well known to digital natives in my experience.)

From my first encounters with computers, this difference in how the machines handled storing data intrigued me. On one hand, there were “databases”. These were stately, structured, orderly digital objects. Mathematicians could say all sorts of useful things about them and create all sorts of useful algorithms to process them. The “databases” are designed for automation.

On the other hand, there was the rebellious, free-wheeling world of text files. Unstructured. Disorderly. A pain in the neck for automation. Difficult to reason about and create algorithms for, but fantastically useful precisely because they were unstructured and disorderly.

I loved text files back then. I still love them today. But as I began to dig deeper into computer science I began to see that the binary world view : database versus text. Structured versus unstructured. Was simple, elegant and wrong. Documents can indeed be “structured”. Document processing could indeed be automated. It is possible to reason about them, and create algorithms for them, but it took me quite a while to get to grips with how this can be done.

My journey of discovery started with an ADM 3A+ terminal to a VAX 11/780 mini-computer (by day) [2] and an Apple IIe personal computer running CP/M – by night[3].

For the former, a program called RUNOFF. For the latter, a program called Wordstar and one of my favorite pieces of hardware of all time : an Epson FX80  dot matrix printer.