Sean McGrath: 06/06/2010

Thursday, June 10, 2010

And the answer is?...

Today I have received a couple of separate pieces of feedback from folks wondering when - and if - I am going to talk about the solutions to the problem areas I have been outlining vis-a-vis XML in legislatures/parliaments.

Yes, I will be talking about the solutions and I will start doing it soon. Nearly there now laying out the most important problem areas... Your patience is appreciated. It will be must easier for me to explain all the "whys" that the architecture discussion will provoke, once I have laid out the problem areas that need to be tackled.

Wednesday, June 09, 2010

XML in legislature/parliament environments: the subtle inter-play between legal content and legal content renderings

Last time in this series on KLISS, I talked about the non-trivial nature of document validation in legislative/parliamentary environments. Today I want to turn to the subtle inter-play between legal content and legal content renderings.

Earlier in this series, I talked about how rendering algorithms are notoriously application specific and how, in a worrying way, part of the meaning of a legal document – the part dependent on the rendering – can be locked up inside unknown/unknowable algorithms inside possibly proprietary software.

This is already a pretty subtle point. Some folks I have encountered in legislatures/parliaments have been incredulous when I point out that no amount of beautifully open XML gives them ownership over their own content unless they own any semantics that may be lurking in the interplay between the data and its rendering. Folks are often surprised to find that I am a big believer in vellum copies and non-fugitive inks and 2400 DPI tiff images. The subtleties surrounding renderings and semantics are largely the reason why.

Here I want to turn to an even more subtle inter-play between legal content and legal content presentation. One that is sadly, getting worse as technology advances...

Consider a web page. I publish it. You pull it down to your computer. Now, are you looking at what I put up there? How confident can we be that we are looking at the same thing? I send you a word-processor file via e-mail. You open it in your word-processor. How confident can we be that we are looking at the same thing? (To keep matters simple, lets forget about man-in-the-middle attacks for now. That is a whole other topic that I want to return to later in the context of authentication of legal materials...)

Some facts worthy of consideration...

You cannot control what my browser does with your HTML document. It may silently chop off your content, render the content differently left-to-right, change the fonts used etc. etc.
You cannot control what my word processor does with your word-processor document. Even if I am using the exact same application as you, I most likely have different screen resolution, different fonts, different printer...all of which can impact how your document appears to me – either on screen or on paper when I print it.
Some web browsers, interpret the characters they see in web pages differently based on the browsing history of the user. That means I may see something different when I view your document depending on what document I looked at *last*.
Most author/edit systems compete on levels of content auto-generation. E.g. they may automatically ornament lists for you (1,2,3 or i,ii,iii etc.) or add hyperlinks for you or generate TOC's for you...The algorithms for doing so are notoriously different from one application to another and in fact, between versions of the same application. Who knows what list numbering I will get when I load your document into my word processor?
Content and document management systems are becoming increasingly sophisticated in how they serve out content. Long gone are the days when static pages are served out over Apache. Nowadays, the stream of bytes sent back to an HTTP GET request involves a lot of contextual dependencies. For example, who the user is, what area of the world they are connecting from, what http-accept headers were provided, what cookied information is already present, what content is being aggregated on the server side... In short, the bytes I get sent may be very different from the bytes you get sent. In fact, the bytes I get might change depending on when I ask for them, where I am when I ask...

The conclusion I have come to is that the digital revolution has resulted in many of the worlds documents becoming somewhat quantum mechanical in nature. By that I mean, it is not possible to know for sure what we will see until we actually *look* - using software to do the looking. Once we look, we collapse the quantum uncertainty but if we have looked some other way – with some other tool or from some other machine – we may have got a different result. (This was the topic of an XTech conference keynote given by a more hirsute version of my current self in 2008.)

Quantum mechanical uncertainties are fine and dandy in physics and most of the time it really doesn't matter in content delivery - but it really, really matters for law! The notion that the law might change depending on who or what or when or how it is being looked at, is not a good thing.

Now here is the unfortunate fact of life: the closer a piece of digital content is to being semantically rich the more pronounced the quantum uncertainties of its rendering semantics are. Or, to put it bluntly, the “purer” your XML from a standard XML model perspective, the more likely it is that you do not know for sure what the text will look like when rendered. In legislatures/parliaments – as I have outlined in a number of previous mosts – rendering often really, really matters and impacts what the text actually means.

Sometimes people ask me "what is the best file format to store our laws in?". I generally answer "mu" and quickly explain that the premise of the question is incorrect in my opinion. There is no one-size-meets-all-needs file format for law. There cannot be because of the mutually incompatible requirements of semantic richness and rendering fidelity and ease of author/edit and tamper evidence and...

The best that can be done, in most cases, is to establish a normative triple consisting of (data,rendering,rendering context) and clearly assert the secondary non-normative nature of all triples derived from that. The most common electronic master rendering today is PDF/Postscript but I'm afraid, even that is insufficiently locked down for the critical task of being the normative rendering of law in my opinion. Most PDF readers silently perform font substitutions for example, again creating fidelity issues with respect to multiple renderings of the same byte stream. Also, some very common – and very important symbols like the Euro symbol and the double S section symbol step outside US ASCII into the highly uncertain world of Unicode. I have lost count of the number of times or section symbols have silently disappeared in my law publishing work-flows on the way to paper production. In today's XML/Web world, other commonly messed up characters (which I will not include here in case they get messed up!) include less than signs, ampersands and so on. (See ampersand attrition).

The picture is further complicated by the common practice of pre-press processing of postscript and PDF files prior to printing. Anybody who thinks that creating PDF or Postscript locks down their content should really visit a print-shop and watch what really goes on in modern prepress environments.

In summary, firstly, not only is separating content and rendering not always as simple as it might sound but in the case of legislative informatics, it can actually be a very bad idea. Having law that changes depending on the who/what/when/where/how of its observation is not a good thing. Secondly, it is an unfortunate but unavoidable law of the universe that the more semantic the data is, the more non-determinism is present in its renderings and in legal informatics (don't shoot the messenger!) renderings matter.

Next up, the interesting issue of legislative/parliamentary content aggregation and derived document types with some probable detours into the world of Xanadu, Strange loops and hermeneutic circles.

Tuesday, June 08, 2010

XML in legislature/parliament environments: the complexity of amendment cycle business rules

Today in this KLISS series, I want to turn to number 5 in my list of concerns about the XML standard model in legislatures/parliaments: the complexity of amendment cycle business rules that often pre-date computers and cannot be changed to make life easier for modern software.

The XML standard model has it that many business-rules can be encoded in structure rules so that structure-aware XML editors can make sure that documents are valid per the rules every step of the way. This is problematic in legislatures/parliaments for many reasons. I want to talk a little about two main areas where its a lot more complicated that simply pressing a schema-oriented “validate” button.

The contextual complexity of business rules
The syntactic complexity of business rules

The contextual complexity of business rules

The validity of a document such as a bill draft cannot, in general, be computed simply by looking at the document itself. There is a large contextual space surrounding the document that strongly influences validity. Some examples (with a somewhat US-centric slant...)

The sponsor(s) of the bill must be elected members at the time the draft is made
The date of drafting must be after the pre-filing date
If the date is after the pre-filing date but before start of session, its not valid unless there is an accompanying pre-filing certificate
Any statute pulled in must come first from current year bills if that statute has been modied already this session
etc...

How many of those rules are expressible in your XML schemas? Not many. Validity here is all contextual and requires lookups and probing for auxiliary information to compute validity.

The syntactic complexity of business rules

Examples in this category include:

If is a resolution, the first word must be “Whereas”
All citations must be bluebook compliant
New sections must use the phrase “New sec.” followed by an n-space and followed by a number except for the first section which must use “New section.”
In journals, the messages must come after the committee referrals, even if they happen before it from a time perspective.
All pulled in sections can have an effective date and if specified, it must be a date later than the current date. As well as real dates, the “date” might just say “register” in which case the real date of enactment is that specified in the register.
...

How many of those rules are expressible in your XML schemas? Not many. Validity here is mostly complex lexical/syntactic form checking. In XML vernacular, most of the validity here features #PCDATA constraints not constraints on tags or attributes.

Now granted, once you step outside purely grammar oriented notions of validity, tools like Schematron can help but there is no escaping the fact that most real world validity constraints on legislative/parliamentary documents require the services of a fully Turing complete programming environment which rich access to contextual data sets in order to compute validity.

The XML standard model talks a lot about how great it is to be able to ensure that things nest inside other things as they should per the grammar rules. I'm not saying that this is not useful (at times) but it only gives you coverage of a very small fraction of the validity business rules that you need in legislative informatics. Add on top of that the fact that notions of validity change at pretty much every step of the workflow from bill draft to enrolled bill and you can hopefully see the limitations of the XML standard model.

Now it can be argued that some of the business rules could be simplified and that is doubtless true but in most legislatures, the drafting rules and the procedure rules are the result of decades – sometimes centuries worth – of evolution. Making life easy for back-office computers tends not to feature highly on the wish-lists of legislators. Moreover, politics being politics, activity in the front-office often serves to further complicate the workflows in the back-office.

Some of you may be doubting that the workflow and the validation can be all that complicated really. To get a flavor of the issues involved, I will end with some links to interesting documents that speak to the complexity of validity. Oh, and one final complexity I should mention...all of these rules are of course, subject to change at any time. In fact, most legislatures/parliaments have a procedure in their rules known as “suspend the rules” (Having fun yet?) How will your XML schema handle that? :-)

Finally, finally, for those interested in the implications of a rule-based system in which changing the rules is one of the rules, may be interested in Nomic.

Next up in this KLISS series, the subtle inter-play between legal content and legal content renderings.

Sean McGrath

Featured Post

Linkedin