Saturday, September 05, 2009

SGML and justifiable complexity

Rick Jelliffe is writing some interesting stuff these days on parsing : SGML, XML, HTML etc. Rick talks about GROVES and that triggered a flashback. Boy, those were the days! DSSSL, HyTime...I remember GROVES being extruded into Graphical Representation of Property values. Rick says GROVEs—Groupings Of Valid Elements. Tomayto. Tomato. Its very instructive to watch the recent RDF goings-on in the light of the GROVE stuff of old.

Anyway, Rick makes the important point that you cannot linearize SGML parsing because it has feedback. Amen to that. SGML has more feedback loops than a room full of amps and microphones - as anyone who has tried to write a true SGML parser will tell you.

For me, the big question is this: is that complexity justified? Given that SGML is, after all, an invented language, its degree of parsing computational complexity is in human hands. With invented languages, we make parsing problems that we then have to solve. Does the cost outweigh the benefit?

In the case of SGML, I believe the answer is no. Charles Goldfarb has a brilliant mind but it is the mind of a lawyer moreso than a computer scientist in my opinion.

Now I work with legal texts a lot in my day job. I have read Peter Suber's Paradox of Self Amendment and I'm pretty familiar with the difficulties of creating homomorphisms between concepts from logic and concepts from jurisprudence.

    "That legal rules may be bad logic and good jurisprudence at the same time is yet to be established, of course, but I will at least allow myself to proceed as if that conclusion were not foreclosed a priori." -- The Paradox of Self Amendment


I am as fond of hermeneutic circles as the next language nerd person but I think we need to create Strange Loops with caution in computer science. We need to bring an awareness of the issues they create downstream from the intellectual delights involved in their creation.

Now insofar as markup languages are attempting to be expressive in a human language sense, we get pulled towards parsing complexity. Insofar as we are designing them for machine readabilty, we get pulled towards simple models in the Chomsky-esque taxonomies of language types.

It is the age old debate in disguise. Are markup languages a branch of linguistics or a branch of mathematics?

The answer of course is "yes" and there-in lies the heart of the problem.