Featured Post

Linkedin

 These days, I mostly post my tech musings on Linkedin.  https://www.linkedin.com/in/seanmcgrath/

Wednesday, June 09, 2010

XML in legislature/parliament environments: the subtle inter-play between legal content and legal content renderings

Last time in this series on KLISS, I talked about the non-trivial nature of document validation in legislative/parliamentary environments. Today I want to turn to the subtle inter-play between legal content and legal content renderings.

Earlier in this series, I talked about how rendering algorithms are notoriously application specific and how, in a worrying way, part of the meaning of a legal document – the part dependent on the rendering – can be locked up inside unknown/unknowable algorithms inside possibly proprietary software.

This is already a pretty subtle point. Some folks I have encountered in legislatures/parliaments have been incredulous when I point out that no amount of beautifully open XML gives them ownership over their own content unless they own any semantics that may be lurking in the interplay between the data and its rendering. Folks are often surprised to find that I am a big believer in vellum copies and non-fugitive inks and 2400 DPI tiff images. The subtleties surrounding renderings and semantics are largely the reason why.

Here I want to turn to an even more subtle inter-play between legal content and legal content presentation. One that is sadly, getting worse as technology advances...

Consider a web page. I publish it. You pull it down to your computer. Now, are you looking at what I put up there? How confident can we be that we are looking at the same thing? I send you a word-processor file via e-mail. You open it in your word-processor. How confident can we be that we are looking at the same thing? (To keep matters simple, lets forget about man-in-the-middle attacks for now. That is a whole other topic that I want to return to later in the context of authentication of legal materials...)

Some facts worthy of consideration...

  • You cannot control what my browser does with your HTML document. It may silently chop off your content, render the content differently left-to-right, change the fonts used etc. etc.
  • You cannot control what my word processor does with your word-processor document. Even if I am using the exact same application as you, I most likely have different screen resolution, different fonts, different printer...all of which can impact how your document appears to me – either on screen or on paper when I print it.
  • Some web browsers, interpret the characters they see in web pages differently based on the browsing history of the user. That means I may see something different when I view your document depending on what document I looked at *last*.
  • Most author/edit systems compete on levels of content auto-generation. E.g. they may automatically ornament lists for you (1,2,3 or i,ii,iii etc.) or add hyperlinks for you or generate TOC's for you...The algorithms for doing so are notoriously different from one application to another and in fact, between versions of the same application. Who knows what list numbering I will get when I load your document into my word processor?
  • Content and document management systems are becoming increasingly sophisticated in how they serve out content. Long gone are the days when static pages are served out over Apache. Nowadays, the stream of bytes sent back to an HTTP GET request involves a lot of contextual dependencies. For example, who the user is, what area of the world they are connecting from, what http-accept headers were provided, what cookied information is already present, what content is being aggregated on the server side... In short, the bytes I get sent may be very different from the bytes you get sent. In fact, the bytes I get might change depending on when I ask for them, where I am when I ask...

The conclusion I have come to is that the digital revolution has resulted in many of the worlds documents becoming somewhat quantum mechanical in nature. By that I mean, it is not possible to know for sure what we will see until we actually *look* - using software to do the looking. Once we look, we collapse the quantum uncertainty but if we have looked some other way – with some other tool or from some other machine – we may have got a different result. (This was the topic of an XTech conference keynote given by a more hirsute version of my current self in 2008.)

Quantum mechanical uncertainties are fine and dandy in physics and most of the time it really doesn't matter in content delivery - but it really, really matters for law! The notion that the law might change depending on who or what or when or how it is being looked at, is not a good thing.

Now here is the unfortunate fact of life: the closer a piece of digital content is to being semantically rich the more pronounced the quantum uncertainties of its rendering semantics are. Or, to put it bluntly, the “purer” your XML from a standard XML model perspective, the more likely it is that you do not know for sure what the text will look like when rendered. In legislatures/parliaments – as I have outlined in a number of previous mosts – rendering often really, really matters and impacts what the text actually means.

Sometimes people ask me "what is the best file format to store our laws in?". I generally answer "mu" and quickly explain that the premise of the question is incorrect in my opinion. There is no one-size-meets-all-needs file format for law. There cannot be because of the mutually incompatible requirements of semantic richness and rendering fidelity and ease of author/edit and tamper evidence and...

The best that can be done, in most cases, is to establish a normative triple consisting of (data,rendering,rendering context) and clearly assert the secondary non-normative nature of all triples derived from that. The most common electronic master rendering today is PDF/Postscript but I'm afraid, even that is insufficiently locked down for the critical task of being the normative rendering of law in my opinion. Most PDF readers silently perform font substitutions for example, again creating fidelity issues with respect to multiple renderings of the same byte stream. Also, some very common – and very important symbols like the Euro symbol and the double S section symbol step outside US ASCII into the highly uncertain world of Unicode. I have lost count of the number of times or section symbols have silently disappeared in my law publishing work-flows on the way to paper production. In today's XML/Web world, other commonly messed up characters (which I will not include here in case they get messed up!) include less than signs, ampersands and so on. (See ampersand attrition).

The picture is further complicated by the common practice of pre-press processing of postscript and PDF files prior to printing. Anybody who thinks that creating PDF or Postscript locks down their content should really visit a print-shop and watch what really goes on in modern prepress environments.

In summary, firstly, not only is separating content and rendering not always as simple as it might sound but in the case of legislative informatics, it can actually be a very bad idea. Having law that changes depending on the who/what/when/where/how of its observation is not a good thing. Secondly, it is an unfortunate but unavoidable law of the universe that the more semantic the data is, the more non-determinism is present in its renderings and in legal informatics (don't shoot the messenger!) renderings matter.

Next up, the interesting issue of legislative/parliamentary content aggregation and derived document types with some probable detours into the world of Xanadu, Strange loops and hermeneutic circles.

1 comment:

Gannon J. Dick said...

I'm not sure if an Autonomous Intelligent Cyber Entity (AiCE) is a good witch or a bad witch.

CodeX

... but I think it's a witch. IANAL, but my friends in the legal community seem to think litigation (scale=years) is the way to keep the witch "good". I think the assumption of goodness is suspect as a way around enumerating badness.

What do you think Sean ?