Units of information and micro-documents
To my mind, the most important part of modeling a document-centric world such as legislatures for author/edit is decided what the boundaries are between information objects. After all, at some level, some information object needs to be edited. We will need all the classic CRUD functions for these objects so we need to pick them carefully.When I look at a corpus of legal information I see a fractal in which the concept of "document" exhibits classic self-similarity. Is a journal a document? How about a title of statute? Or a bill? How about the Uniform Commercial Code. Is that a document?
Pretty much any document in a legislature can reasonably be thought of as an aggregation of smaller documents. A journal is an iteration of smaller chamber event documents. A title of statute is an iteration of statute sections. A volume of session laws is an iteration of acts...and so on.
This creates an interesting example of a so called banana problem. How do you know when to stop de-composing a document into smaller pieces?
My rule of thumb is to stop decomposing when the information objects created by the decomposition cease to be very useful in stand-alone form. Sections of statute are useful stand-alone. The second half of a vote record less so. Bills are useful standalone. The enacting clause less so.
The good news is that when you do this information de-composition, the size of the information objects that require direct author/edit support get smaller and they get less numerous. They get smaller because you do not need an editor for titles of statute. A title is what you get after you aggregate lots of smaller documents together. Don't edit the aggregate. Edit the atoms. They get less numerous because the decomposition exposes many shared information objects. For example, the bill amendment may be a document used in chamber but it is also in the journal. Referring a bill to a committee will result in a para in the journal but will also result in an entry in the bill status application...and so on.
In KLISS we generally edit units of information – not aggregates. We have a component that knows how to join together any number of atoms to create aggregates. Moreover, aggregates can be posted into the KLISS time machine where they become atoms, subject to further aggregation. A good example would be a chamber event document that gets aggregated into a journal but the resultant journals are themselves aggregated into a session publication known as the permanent journal.
Semantics and Micro-formats
KLISS makes extensive use of ODF for units of information in the asset repository. We encode metadata as property-value pairs inside the ODF container. We also leverage paragraph and character style names for encoding "block" and "inline" semantics. As discussed previously line and page numbers are often critically important to the workflows and we embed these inside ODF markup too.
The thin client author/edit approach
Some of our units of information are sufficiently small and sufficiently metadata-oriented that we can "author" them using Web-based forms. In other words, the asset will be stored in the time machine as an ODT document but the users author/edit experience of it will be a web form. We make extensive use of django for these.
This is particularly true on the committee/chamber activity side of a legislature where there are a discrete number of events that make up 80% of the asset traffic and the user interface can be made point-and-click with little typing.
The thick client author/edit approach
Some of our units of information are classic word-processor candidates. i.e. a 1500 page bill, a 25 page statute section consisting of a single landscape table with running headers, a 6 level deep TOC with tab leaders and negative first line indents...For these we use a thick client application created using Netbeans RCP which embeds OpenOffice. We make extensive use of the UNO API to automate OpenOffice activities. The RCP container also handles identity management, role based access control and acts as a launchpad for mini-applications – created in Java and/or Jython – that further extend our automation and customization capabilities on the client side.
RESTian time-machine interface
Although we tend to speak of two clients for author/edit in KLISS – the thick client and the thin client – in truth, the set of clients is open-ended as all interaction with the KLISS time machine is via the same RESTian interface. In fact, the KLISS server side does not know what was used to author/edit any document. This gives us an important degree of separation between the author/edit subsystem and the rest of the system. History has shown, that the most volatile part of any software application is the part facing the user. We need to know that we can evolve and create brand new author/edit environments without impacting the rest of the KLISS ecosystem.
Why ODT?
ODT was chosen as it presented the best trade off when all the competing requirements of an author/edit subsystem for legislatures were analyzed. A discussion of the issues on the table when this selection was made are listed here. To that list I would also add that ODT is, to my knowledge, without IP encumberments. Also, the interplay between content and presentation is so important in this domain that it is vital to have free and unfettered access to the rendering algorithms in order to feel fully in possession of the semantics of the documents. I'm not saying that a large corpus of C++ is readily understandable at a deep level but I take great comfort in knowing that I can, in principle, know everything there is to know about how my system has rendered the law by inspecting the rendering algorithms in OpenOffice.
Next up, long term preservation and authentication of legal materials in KLISS.