Featured Post


 These days, I mostly post my tech musings on Linkedin.  https://www.linkedin.com/in/seanmcgrath/

Sunday, May 30, 2010

XML in legislature/parliament environments : The centrality of line/page number citation in amendment cycles

Last time I talked about KLISS, I listed 7 reasons why the standard analysis of how XML can/should be used in legislatures/parliaments is simple, elegant and (I would argue), wrong.

The first reason I listed in support of that strong assertion is the centrality of line/page number citation in amendment cycles. That is the topic I want to address in this post.

The standard XML model goes something like this:

1 - find out what the publication/outputs are (be they paper, cd-roms, websites, e books, whatever)

2 - classify the outputs from a content perspective i.e. bills are different from journals are different from...

3 - create hierarchical schemas that capture the logical structure of the outputs. Separate out the logical structure from the "accidents" of the rendering you are looking at. I.e. separate the content from the presentation of the content. Things like font, line breaks, page breaks, list ornamentations, footnote locations etc.

4 - figure out how to programmatically layer on the presentation information using stylesheet technologies, bespoke rendering algorithms etc.

5 - Create author/edit workflows that concentrate on the logical view of the data. Leave as many aspects of presentation to be applied automatically in the final publication production systems as possible.

If all goes according to plan applying this standard model, you end up in a very happy place. Your content is independent of any one output format. You can create new renderings (which creates new products) from existing content cheaply. You can search through your own content at a rich semantic level. You can re-combine your content in a myriad of different ways to create even more new products from your content. If you have done a good job of making your logical models semantically rich, you can even automate the extraction of content so well that you basically get new products from your content "for free" in terms of ongoing effort...

Now that semantically rich, machine readable data is clearly the next big thing on the internet in terms of content disemmination, you end up being able to pump out RDF triples or RDFa or microformats or good old CSV easily. You get to pop a RESTian interface on front of your repository to allow consumers to self-serve out of it if that is what you want to do.

When this works, it is a truly a thing of beauty. The value proposition is compelling. However, the history of the world is littered with examples - all the way back to the SGML days - of where it wasn't quite so simple in practice.

In order to leverage the benefits you really need to think matters through to another level of detail in all but the most trivial of publishing enterprises. Sadly, many historical XML initiatives (and SGML initatives before that), in legislatures/parliament have not gone to that extra level of analysis before breaking out XML editors and XSLT and cutting loose building systems.

Line/page numbers are a classic example of the extra level of detail I am talking about. Here are some examples:

  • An example of a floor amendment in the Illinois General Assembly
  • An example of an amendatory bill from the Irish Parliament.
  • An example of an amendatory bill from the British Parliament.
  • An example of a committee report from the US Senate

Note how important - I would argue central - line/page numbers are to what is going on here at a business level. In the standard XML model discussed above, line/page numbers are throwaway artifacts of the publishing back-end. They are not important and certainly should not be part of the logical data model.

But look at the problem from the perspective of the elected representatives, the drafting attorneys, the chamber clerks, the lobbiests,...The line/page numbers are absolutely *central* to the business process of managing how the law at time T becomes that law at time T+1 (as discussed here). It is that process - that change process - that is central to what a legislature/parliament is all about.

I could write a book about this (and someday I probably will) but for now, I want to end with some points - each of which is deserving of its own detailed explication:

  • Bismark once said that law and sausages are similar in that you really don't want to know how they are made. If value is to be derived from computer systems in legislatures/parliaments, it is necessary to get into the sausage machine and understand what is really going on inside.

    What you will find there is a structured and rigorous process (albeit a very complex one) but the bit that is truly structured is the *change regimen for documents* - not the logical structure of the documents themselves. Granted, when the law making process is finished and out pops a new statute or a new act, the document generally has an identifiable logical structure and generally no longer has line/page numbers. However, it spent most of its life up to that point "un-structured" from a classical XML perspective. If you are going to derive value from XML inside the legislature/parliament as opposed to downstream from it, you need to embrace that fact in my opinion.

  • On the face of it, it should be a simple matter to throw in line/page numbers into the data model, right? Its just tags, right? Sadly, no.

    • Firstly, amendment structures have a nasty habit of overlapping logical structures. I.e. the amendment starts half way through one paragraph and extends to midway through the second bullet point...This is hard to model inside XML (or SGML) as they are both rooted (pun intended) in the notion of one dominant, directed acyclic graph inside which all the content lives. Also, most XML validation techniques are ultimately based on Chomsky-esque production grammars that again, have the concept of a dominant hierarchical decomposition at their core.
    • Secondly - this one is a doozy - line/page numbers only come into existence once a document is rendered. This creates a deep complexity because now part of what you need to manage in your data model, only comes into existence when the data model is processed through a rendering algorithm.
    • Thirdly - this one is the real kicker - line/page numbers are the result of applying algorithms (known as H+J algorithms). Every word processor, XML editor, every web browser, every DTP package, every typesetting system, every XSL:FO implementation, every DSSSL implementation on the planet that I know of has its own way of doing it. You are pretty much guaranteed to get different results when you switch H+J algorithms.

      Moreover, the result of the H+J is influenced by the fonts you have installed, the printer your computer has as its default printer, the version of postscript being used...

  • I have heard of drafting attorneys saying "I did not spent years of my life and X dollars in law school just to type in line and page numbers". The implications of not addressing line/page number issues can be enormous to the value proposition.
  • Sadly, not only are H+J algoritms very different in different systems, they are often proprietary and essentially beyond human comprehension. Take Microsoft Word for example, how many people do you think completely understand the complex business logic it applies in deciding how to render its pages and thus where the line/page numbers fall? The variables are numerous. The state space, huge.
  • Searching for hyphenation in patent databases produces a large number of hits
  • I have seen folks pat themselves on the back because they have XML in their database and XML in their authoring system and therefore they *own* their own content and destiny. I would argue that unless you also own the algorithms that produced the rendering of that content, then you don't actually own your own data - at least not in legislative/parliamentary environments. If I cannot be sure that when I re-render this document in 10 years time, I will get the same result all the way down to the line/page numbers critical to the business process, do I really own my own destiny?
  • A cherished tenet of the standard XML model is that content and presentation can and should be separated. In legislatures/parliaments it is vital that they are not separated. Many classically trained XML technologists find that deeply disturbing.
  • An excellent way of locking down line and page numbers of course is to render to a paginated form such as PDF. This works fine on the output side of the legislature/parliament but fails inside it because legislative documents are iterated through amendment cycles. The line/page numbers output on iteration one are the input to the second amendment cycle which produces a new set of line/page numbers...
  • I find myself frustrated at times when I hear folks talk about standardizing data formats for legislative materials as they often fly by the centrality of line/page numbers and go straight to purely logical models. It is particularly frustrating from an eDemocracy perspecitive because to do that right, in my opinion, you want to provide access to the work-in-progress of a legislature/parliament. Bills as they are being worked in Bismark's sausage machine. Having deep electronc access to the laws after they are laws is great and good but wouldn't it be better for eDemocracy if we had access to them before they were baked?
  • You might be thinking that the solution to this whole problem is to do away with line/page number-based amendment cycles completely. I do not disagree. However, in many jurisdictions we are dealing with incredibly long standing, incredibly historic, tried-and-trusted business processes inside what are the most risk averse and painstakingly detail-oriented institutions I have ever come across, staffed with some of the sharpest and most conscientious minds you will find anywhere. The move away from line/page numbers will not be swift.
  • Finally, although I hope I have convinced you that line/page numbers are important and create problems complex and subtle in legislative/parliamentary IT, I don't want to appear negative in my analysis. It is entirely possible to get significant business benefit out of IT - and XML in particular - in legislatures/parliaments. That is what we have done in KLISS and more generally, what is going on in LWB. It is just that you need to bring a lot of knowledge about how legislatures/parliaments actually work into the analysis and design. Implementing straight out of the standard XML playbook isn't going to cut the mustard.

Next time: the complex nature of amendatory actions.


maacl said...

Very interesting read. Your characterization of the legislative machine is surely correct and it is probably naive to think that it will change overnight, but I hope this does not stop you from trying to drag it into this century. Word, line and paragraph based amendments to any document be it an act, law or contract *should* be a thing of the past. Any change should result in a complete restatement of the relevant document, which can be diffed against the other version(s). From the scenarios I have looked at(mostly complex contracts with a lifespan of 5+ years and many draft amendments from several sources) this provides the greatest transparency and best quality (smallest number of errors and best language quality - drafters tend to draft differently when using amendments instead of restatements). What proponents of the old system might not readily disclose to you is how many errors it actually results in. Asking that question has several times given me the leverage to move further away from the old model than would otherwise have been possible.

Anonymous said...

Have you determined a reasonable markup scenario? Semantic + page/line somehow overlaid/intermixed?

Love to see an example.

Also the process flow across amendments.


Sean McGrath said...


Dave asks, "Have you determined a reasonable Markup scenario: Semantic + page/line somehow overlaid/intermixed?".

Yes. It is a compromise solution because this whole area is festooned with wicked problems.

After I get through the current series of posts on problem analysis, I intend to post a series on LEA (Legislative Enterprise Architecture) which will include an explanation of the markup solution we use.


David Collier-Brown said...

I'd suggest that readers want both the "diffs" and the amended document, as well as the option of viewing the document at any point in time.

The latter is quite important: if I take an action on January 12 which is no longer illegal on January 13th, it's rather important for me to know which law applies to me, and what the changes were.

And then I'll want to go to the legislative history to see if I have an argument that the law never should have been applied to me.

The use case is independent of the means of providing the information, but both a line-based and a clause-based amendment mechanism must honor it.

Benjamin said...

It seems to me that some sort of mapping between page/line numbers of a particular rendering of the document and some sort of logical marker in the document would have the ability to provide a 'bridge' of sorts between the two 'interfaces' of the document. Particularly, if the logical marker would be of the form Article, Section, Subsection, Paragraph, and even Sentence, which would still be relatively intuitive.

bboissin said...

Wow, I didn't know other countries had those issue. In the french legislation drafting system, there are no dependances on the form used to represent the law.

Everything (amendements and law) link to each other using numbers (alinea, sentence, section, etc.). I don't know what the reason is, maybe the fact that everything is codified so the page number has no meaning (the consolidated version is not tied to any format and only "virtual").

Very interesting serie of posts though!

bboissin said...

Last question: any reference to H+J algorithm, google doesn't find anything.

Sean McGrath said...


Re H+J. See See http://desktoppub.about.com/od/glossary/g/HyphenJustify.htm.


bboissin said...

Thanks Sean! (Indeed I see how problematic it can be for you).

Aigars Mahinovs said...

There are some legislatures where pages/lines are not used, but instead the paragraphs and their parts are numbered and all changes reference those. Also the numbers mostly don't change - if a paragraph is deleted, it stays as an empty numbered paragraph (with a note saying - erased in year X by bill Y) and new paragraphs are inserted either as new parts or using 139' notation (read as 139 prim).

Sean McGrath said...


Yes indeed. Some legislatures/parliaments do not use line/page numbers and instead use a "medium neutral" citation format. This is a very good thing but I'm afraid it will take quite some time for line/page-centric legislatures/parliaments to move to medium neutral citation approaches.

As for the numbers not changing when numbered paragraphs are deleted. This is an excellent example of where modern conveniences such as auto-numbering can actively get in the way of good legal drafting. Numbers, one allocated, cannot "move" even if that means leaving gaps in the numbers. The numbers are not merely rendering artifacts to be changed by software on a whim. The numbers are truly part of the content and need to be treated as such.

The content and the presentation cannot be prised apart in law. Medium neutral formats remove the problem of page/line fidelity but numbering of micro-document fragments remains a problem, unless all "ornamentation" of paragraphs is considered first class content and not merely computed filligree:-)



Dan McCreary said...

Nice post. Very good information on how XML is used to draft legislation and the workflow around bill authoring. We just released a large library of open source tools for doing "structured" search and retrieval on legislative documents based around the eXist native XML system as part of the Library of Congress NDIIPP project. You can see the reports here: http://www.mnhs.org/preserve/records/legislativerecords/pilot.htm#final

Our analysis of four alternative architectures shows that using native XML is the best way to perform these functions.

Sean McGrath said...

Dr. Data Dictionary,

I agree that classically formulated "structured", "semantic" XML is indeed perfect for search/retrieval functions but the mistake that many make is to conclude from that, that it should be the "master copy" for author/edit and for persistence.

Those who make that mistake, in my opinion, get themselves into trouble because of the mismatch between the ideal structures for search/retrieval and the ideal structures for author/edit. The other (related) problem they can end up in is trying to provide a "one size fits all data model". I.e a model that is used for all author/edit, all audit trail, all search/retrieval, all research...

The way KLISS works is that it purposely eschews any particular search/retrieval specialism in the normative data model. Then, using the RESTian interface to the KLISS "time machine", arbitrary "views" over the data are constructed to meet specific search/retrieval needs.

For example, an XML data model for legislative intent research can be crafted. Then, using the event notification features of KLISS, it can create an idempotent, read/only view over the true repository and allow users to interact with *that*, rather than the normative data.

In KLISS, you can have as many of these "views" as you like. They all work the same way from an architecture perspective. Some might be XML/XQuery, some might be RDF triple stores, some might be NLP-oriented etc.

(Some of the posts after this one in the KLISS series, expand more on the time machine, the "views" etc. and may be of interest.)