Sean McGrath

Sunday, May 30, 2010

XML in legislature/parliament environments : The centrality of line/page number citation in amendment cycles

Last time I talked about KLISS, I listed 7 reasons why the standard analysis of how XML can/should be used in legislatures/parliaments is simple, elegant and (I would argue), wrong.

The first reason I listed in support of that strong assertion is the centrality of line/page number citation in amendment cycles. That is the topic I want to address in this post.

The standard XML model goes something like this:

1 - find out what the publication/outputs are (be they paper, cd-roms, websites, e books, whatever)

2 - classify the outputs from a content perspective i.e. bills are different from journals are different from...

3 - create hierarchical schemas that capture the logical structure of the outputs. Separate out the logical structure from the "accidents" of the rendering you are looking at. I.e. separate the content from the presentation of the content. Things like font, line breaks, page breaks, list ornamentations, footnote locations etc.

4 - figure out how to programmatically layer on the presentation information using stylesheet technologies, bespoke rendering algorithms etc.

5 - Create author/edit workflows that concentrate on the logical view of the data. Leave as many aspects of presentation to be applied automatically in the final publication production systems as possible.

If all goes according to plan applying this standard model, you end up in a very happy place. Your content is independent of any one output format. You can create new renderings (which creates new products) from existing content cheaply. You can search through your own content at a rich semantic level. You can re-combine your content in a myriad of different ways to create even more new products from your content. If you have done a good job of making your logical models semantically rich, you can even automate the extraction of content so well that you basically get new products from your content "for free" in terms of ongoing effort...

Now that semantically rich, machine readable data is clearly the next big thing on the internet in terms of content disemmination, you end up being able to pump out RDF triples or RDFa or microformats or good old CSV easily. You get to pop a RESTian interface on front of your repository to allow consumers to self-serve out of it if that is what you want to do.

When this works, it is a truly a thing of beauty. The value proposition is compelling. However, the history of the world is littered with examples - all the way back to the SGML days - of where it wasn't quite so simple in practice.

In order to leverage the benefits you really need to think matters through to another level of detail in all but the most trivial of publishing enterprises. Sadly, many historical XML initiatives (and SGML initatives before that), in legislatures/parliament have not gone to that extra level of analysis before breaking out XML editors and XSLT and cutting loose building systems.

Line/page numbers are a classic example of the extra level of detail I am talking about. Here are some examples:

An example of a floor amendment in the Illinois General Assembly
An example of an amendatory bill from the Irish Parliament.
An example of an amendatory bill from the British Parliament.
An example of a committee report from the US Senate

Note how important - I would argue central - line/page numbers are to what is going on here at a business level. In the standard XML model discussed above, line/page numbers are throwaway artifacts of the publishing back-end. They are not important and certainly should not be part of the logical data model.

But look at the problem from the perspective of the elected representatives, the drafting attorneys, the chamber clerks, the lobbiests,...The line/page numbers are absolutely *central* to the business process of managing how the law at time T becomes that law at time T+1 (as discussed here). It is that process - that change process - that is central to what a legislature/parliament is all about.

I could write a book about this (and someday I probably will) but for now, I want to end with some points - each of which is deserving of its own detailed explication:

Bismark once said that law and sausages are similar in that you really don't want to know how they are made. If value is to be derived from computer systems in legislatures/parliaments, it is necessary to get into the sausage machine and understand what is really going on inside.

What you will find there is a structured and rigorous process (albeit a very complex one) but the bit that is truly structured is the *change regimen for documents* - not the logical structure of the documents themselves. Granted, when the law making process is finished and out pops a new statute or a new act, the document generally has an identifiable logical structure and generally no longer has line/page numbers. However, it spent most of its life up to that point "un-structured" from a classical XML perspective. If you are going to derive value from XML inside the legislature/parliament as opposed to downstream from it, you need to embrace that fact in my opinion.
On the face of it, it should be a simple matter to throw in line/page numbers into the data model, right? Its just tags, right? Sadly, no.
- Firstly, amendment structures have a nasty habit of overlapping logical structures. I.e. the amendment starts half way through one paragraph and extends to midway through the second bullet point...This is hard to model inside XML (or SGML) as they are both rooted (pun intended) in the notion of one dominant, directed acyclic graph inside which all the content lives. Also, most XML validation techniques are ultimately based on Chomsky-esque production grammars that again, have the concept of a dominant hierarchical decomposition at their core.
- Secondly - this one is a doozy - line/page numbers only come into existence once a document is rendered. This creates a deep complexity because now part of what you need to manage in your data model, only comes into existence when the data model is processed through a rendering algorithm.
- Thirdly - this one is the real kicker - line/page numbers are the result of applying algorithms (known as H+J algorithms). Every word processor, XML editor, every web browser, every DTP package, every typesetting system, every XSL:FO implementation, every DSSSL implementation on the planet that I know of has its own way of doing it. You are pretty much guaranteed to get different results when you switch H+J algorithms.
  
  Moreover, the result of the H+J is influenced by the fonts you have installed, the printer your computer has as its default printer, the version of postscript being used...
I have heard of drafting attorneys saying "I did not spent years of my life and X dollars in law school just to type in line and page numbers". The implications of not addressing line/page number issues can be enormous to the value proposition.
Sadly, not only are H+J algoritms very different in different systems, they are often proprietary and essentially beyond human comprehension. Take Microsoft Word for example, how many people do you think completely understand the complex business logic it applies in deciding how to render its pages and thus where the line/page numbers fall? The variables are numerous. The state space, huge.
Searching for hyphenation in patent databases produces a large number of hits
I have seen folks pat themselves on the back because they have XML in their database and XML in their authoring system and therefore they *own* their own content and destiny. I would argue that unless you also own the algorithms that produced the rendering of that content, then you don't actually own your own data - at least not in legislative/parliamentary environments. If I cannot be sure that when I re-render this document in 10 years time, I will get the same result all the way down to the line/page numbers critical to the business process, do I really own my own destiny?
A cherished tenet of the standard XML model is that content and presentation can and should be separated. In legislatures/parliaments it is vital that they are not separated. Many classically trained XML technologists find that deeply disturbing.
An excellent way of locking down line and page numbers of course is to render to a paginated form such as PDF. This works fine on the output side of the legislature/parliament but fails inside it because legislative documents are iterated through amendment cycles. The line/page numbers output on iteration one are the input to the second amendment cycle which produces a new set of line/page numbers...
I find myself frustrated at times when I hear folks talk about standardizing data formats for legislative materials as they often fly by the centrality of line/page numbers and go straight to purely logical models. It is particularly frustrating from an eDemocracy perspecitive because to do that right, in my opinion, you want to provide access to the work-in-progress of a legislature/parliament. Bills as they are being worked in Bismark's sausage machine. Having deep electronc access to the laws after they are laws is great and good but wouldn't it be better for eDemocracy if we had access to them before they were baked?
You might be thinking that the solution to this whole problem is to do away with line/page number-based amendment cycles completely. I do not disagree. However, in many jurisdictions we are dealing with incredibly long standing, incredibly historic, tried-and-trusted business processes inside what are the most risk averse and painstakingly detail-oriented institutions I have ever come across, staffed with some of the sharpest and most conscientious minds you will find anywhere. The move away from line/page numbers will not be swift.
Finally, although I hope I have convinced you that line/page numbers are important and create problems complex and subtle in legislative/parliamentary IT, I don't want to appear negative in my analysis. It is entirely possible to get significant business benefit out of IT - and XML in particular - in legislatures/parliaments. That is what we have done in KLISS and more generally, what is going on in LWB. It is just that you need to bring a lot of knowledge about how legislatures/parliaments actually work into the analysis and design. Implementing straight out of the standard XML playbook isn't going to cut the mustard.

Next time: the complex nature of amendatory actions.

Thursday, May 27, 2010

XML in legislature/parliament environments

Last time I talked about KLISS I said I would talk about some of the reasons why a legislature/parliament is not a happy hunting ground for the blind application of standard IT architecture patterns from the document management/content management/publishing space.

The first, often quite glaring architectural non-sequitur goes like this:

1) legislatures/parliaments are full of very structured documents : bills, resolutions, journals, calendars, statutes, annotations...all have readily apparent structure.

2) XML is all about handling very structured documents.

3) Therefore, classic XML approaches fit legislatures/parliaments.

There are a variety of reasons why this analysis is, in my opinion, wrong but it will take me a number of posts to explain why.

Before I start, let me point out that XML *has* an enormous role to play in legislatures/parliaments but it cannot be simply applied blindly per the standard XML value model without causing significant problems. The 7 main reasons are:

a) the centrality of line/page number citation in amendment cycles

b) the complex nature of amendatory actions

c) the critical nature of fidelity with historically produced renderings

d) the fluid nature of work-in-progress legal assets

e) the complexity of amendment cycle business rules that often pre-date computers and cannot be changed to make life easier for modern software

f) the subtle inter-play between legal content and legal content renderings

g) content aggregation and derived document types

Next up : "the centrality of line/page number citation in amendment cycles"

Tuesday, May 25, 2010

law.gov goes to Washington

The law.gov initiative has arrived in Washington. Stream of today's meeting at the Committee on House Adminstration is here.

I strongly recommend the law.gov video set now up on Youtube covering the meetings held at a variety of law schools over the last few months. Much fascinating stuff.

Monday, May 24, 2010

KLISS. First things first. What is a legislature/parliament?

I thought I'd start talking about the technical aspects of KLISS at the highest possible level...

Before progress can be made modelling any domain inside a computer system, it is necessary to put a box around the highest possible view of the domain, thereby establishing what is inside the model boundary and what is outside.

At the highest possible level of abstraction, a legislature/parliament is a black box, into which a corpus of law (at some time, T) is injected. The legislature/parliament then acts on that corpus to produce an updated corpus of law at some time T+1.

Thats the highest level model that I have found useful to start discussions of enterprise legislative architectures. Some interesting points about this:

0) The "Corpus of Law" is easier to state than it is to enumerate. It includes the primary law for the legislature/parliament obviously but also, depending on the jurisdiction, secondary forms of law and "softer" forms of regulation such as chamber rules.

1) The "Corpus of Law" subject to modification by the legislature/parliament is smaller than the corpus in force. I.e. federal level laws that bind in various ways, at state level, and yet are not modifiable at state level. Some organizations may adopt rules of order like Rogers or Masons, that are not modifiable directly.

2) Note the feedback loop. The corpus of law at time T is the input to the function (the black box) that produces the law at time T+1, and that then becomes the input for the next legislative act. This feedback loop is the primary source of complexity in legislative informatics.

3) Note the self reference. The corpus often includes laws that specify how laws are made. Those laws are themselves often subject to change inside the black box.

4) In order to qualify as a democratic entity, the transition from Corpus(T) to Corpus(T+1) must be rigorously audit trailed and that audit trail is *itself* an output of the legislature/parliament. I.e. journals, committee reports etc. that expose how Corpus(T) became Corpus(T+1).

5) Note the word "act". Legislatures/parliaments act on a corpus i.e. changes are proposed, debated and potentially implemented. The black box can be thought of as an event driven machine. A machine in which events happen (new bills introduced, amendments proposed, votes taken etc..) and these events cause down-stream events/actions.

6) In order for the audit trail to be comprehensive and accurate it must be able to cite - not just documents like bills, committee meeting minutes etc. - but cite those documents as they looked at arbitrary *points in time*.

Some formalisms I find useful in thinking through this model:
Speech acts provide a nice model for reasoning about actions and reactions, causes and effects inside a legislature.

Rigid designators provide a nice model for reasoning about citations and dealing with the incredibly rich, temporally laced, network graph that legislatures both consume and produce.

Eventual consistency provides a nice model for reasoning about the synchronization of all the disparate views of legislative outputs that must somehow be kept consistent: bill status, agenda minutes etc. etc., each produced in paper, PDF, HTML versions PLUS twitter feeds, rss feeds etc.

Peter Suber's paradox of self amendment is an excellent analysis of the problems inherent in a system that operates under a set of modifiable rules in which modifications to the rules must occur according to the rules set out in the set of modifiable rules :-)

Next time: some thoughts on the main reasons why legislatures/parliaments are very different animals to most process-centric/document-centric/publishing-centric organizations and why, most "classic" design patterns need to be modified if they are to be successful in Legislative Enterprise Architectures.

Saturday, May 22, 2010

Phew!

Well, it has been almost *two years* since I came to live in Lawrence Kansas to work on the KLISS project. KLISS is by far the most all-inclusive vision for eDemocracy I have come across and it has been a very rewarding and very intense work, extending our core LWB product to be able to implement the vision.

So intense, that I have had little brain-space to devote to writing/blogging...

I hope to be able to devote some time to blogging/writing in the months ahead. The way LWB meets the challenges of the KLISS vision are, I think, interesting and I'm going to try to find ways to lay out the key technical concepts here.

When asked to outline KLISS I generally start by saying that KLISS is an application in the same way that Chinatown is a restaurant. It is absolutely vast in its scope. It touches everything from creating new bills to realtime display of legislative activity to remote testimony at committees to authentication of born-digital legal materials. And everything in between.

So anyway, if document management, or XML, or jurisprudence, or temporal logic, or political science, or event driven architectures, or democracy or digital preservation, or authentication, or legislative intent, or law.gov or data.gov or semantics is your thing, you might be interested in what will appear here over the next while!

If you work in this field or are involved in creating applications in similar areas, please get in touch. I'd love to compare notes.

Tuesday, May 04, 2010

Cohen plays Sligo

Typical. I move to the other side of the planet and then Cohen decides to play just down the road from my house.

Saturday, April 24, 2010

Executable regulations. And so it begins..in Python

A couple of years ago, I wrote an article on ITWorld about how software may provide a way to improve financial regulation.

It appears that the SEC are heading in this direction for capturing the semantics of complex financial instruments...and doing it with Python to boot.