Sean McGrath

Monday, October 11, 2010

KLISS: Author/edit sub-systems in legislative environments

Last time in this KLISS series, I talked some more about the KLISS workflow model. The time has come (finally!) to talk about how that workflow model incorporates author/edit on the client side i.e. the creation of, or update of, legislative artifacts such as bills, statute sections, chamber events, meeting minutes etc. Earlier on in this series, I explained in reasonable detail, why the author/edit subsystems cannot be simple little text editors and also why they cannot be classic heavy XML editors so I won't go over that ground again here, I'll just cut to the chase about how KLISS actually does it.

Units of information and micro-documents

To my mind, the most important part of modeling a document-centric world such as legislatures for author/edit is decided what the boundaries are between information objects. After all, at some level, some information object needs to be edited. We will need all the classic CRUD functions for these objects so we need to pick them carefully.

When I look at a corpus of legal information I see a fractal in which the concept of "document" exhibits classic self-similarity. Is a journal a document? How about a title of statute? Or a bill? How about the Uniform Commercial Code. Is that a document?

Pretty much any document in a legislature can reasonably be thought of as an aggregation of smaller documents. A journal is an iteration of smaller chamber event documents. A title of statute is an iteration of statute sections. A volume of session laws is an iteration of acts...and so on.

This creates an interesting example of a so called banana problem. How do you know when to stop de-composing a document into smaller pieces?

My rule of thumb is to stop decomposing when the information objects created by the decomposition cease to be very useful in stand-alone form. Sections of statute are useful stand-alone. The second half of a vote record less so. Bills are useful standalone. The enacting clause less so.

The good news is that when you do this information de-composition, the size of the information objects that require direct author/edit support get smaller and they get less numerous. They get smaller because you do not need an editor for titles of statute. A title is what you get after you aggregate lots of smaller documents together. Don't edit the aggregate. Edit the atoms. They get less numerous because the decomposition exposes many shared information objects. For example, the bill amendment may be a document used in chamber but it is also in the journal. Referring a bill to a committee will result in a para in the journal but will also result in an entry in the bill status application...and so on.

In KLISS we generally edit units of information – not aggregates. We have a component that knows how to join together any number of atoms to create aggregates. Moreover, aggregates can be posted into the KLISS time machine where they become atoms, subject to further aggregation. A good example would be a chamber event document that gets aggregated into a journal but the resultant journals are themselves aggregated into a session publication known as the permanent journal.

Semantics and Micro-formats

KLISS makes extensive use of ODF for units of information in the asset repository. We encode metadata as property-value pairs inside the ODF container. We also leverage paragraph and character style names for encoding "block" and "inline" semantics. As discussed previously line and page numbers are often critically important to the workflows and we embed these inside ODF markup too.

The thin client author/edit approach

Some of our units of information are sufficiently small and sufficiently metadata-oriented that we can "author" them using Web-based forms. In other words, the asset will be stored in the time machine as an ODT document but the users author/edit experience of it will be a web form. We make extensive use of django for these.

This is particularly true on the committee/chamber activity side of a legislature where there are a discrete number of events that make up 80% of the asset traffic and the user interface can be made point-and-click with little typing.

The thick client author/edit approach

Some of our units of information are classic word-processor candidates. i.e. a 1500 page bill, a 25 page statute section consisting of a single landscape table with running headers, a 6 level deep TOC with tab leaders and negative first line indents...For these we use a thick client application created using Netbeans RCP which embeds OpenOffice. We make extensive use of the UNO API to automate OpenOffice activities. The RCP container also handles identity management, role based access control and acts as a launchpad for mini-applications – created in Java and/or Jython – that further extend our automation and customization capabilities on the client side.

RESTian time-machine interface

Although we tend to speak of two clients for author/edit in KLISS – the thick client and the thin client – in truth, the set of clients is open-ended as all interaction with the KLISS time machine is via the same RESTian interface. In fact, the KLISS server side does not know what was used to author/edit any document. This gives us an important degree of separation between the author/edit subsystem and the rest of the system. History has shown, that the most volatile part of any software application is the part facing the user. We need to know that we can evolve and create brand new author/edit environments without impacting the rest of the KLISS ecosystem.

Why ODT?

ODT was chosen as it presented the best trade off when all the competing requirements of an author/edit subsystem for legislatures were analyzed. A discussion of the issues on the table when this selection was made are listed here. To that list I would also add that ODT is, to my knowledge, without IP encumberments. Also, the interplay between content and presentation is so important in this domain that it is vital to have free and unfettered access to the rendering algorithms in order to feel fully in possession of the semantics of the documents. I'm not saying that a large corpus of C++ is readily understandable at a deep level but I take great comfort in knowing that I can, in principle, know everything there is to know about how my system has rendered the law by inspecting the rendering algorithms in OpenOffice.

Next up, long term preservation and authentication of legal materials in KLISS.

Thursday, October 07, 2010

ActiveMQ guru required

Looking for an ActiveMQ guru to work 6 month contract out of Lawrence, KS. Possibility for full time post after that for the right person.

Thursday, September 30, 2010

Cutter Consortium Summit, Boston

I will be attending the Cutter Consortium Summit in Boston next month. If you are attending, or in the area, and would like to meet up, let me know.

Friday, September 24, 2010

law.gov gets funding

Congrats to Carl Malamud on the funding for law.gov from Google.

Thursday, September 16, 2010

Pssst...there is no such thing as an authentic/original/official/master electronic legal text

I know of no aspect of legal informatics this is plagued with more terminology problems than the question of authentic/original/official/master versions of legal material such as bills and statute and caselaw. In this attempt at addressing some of the confusion, I run the risk of adding more confusion but here goes...

1 - The sending fallacy
What would it mean for the Library of Congress to send their Gutenburg Bible to me? Well, they would put it in a box and ship it to me. Afterwards, I would have that instance of the Gutenburg Bible and they would not have it. The total number of instances of the Gutenburg Bible in the world would remain the same. The instance count chez McGrath would increment and the instance count chez LOC would decrement.

If they were to electronically send it to me, there would be no "sending" going on at all. Instead, a large series of replications would happen - from storage medium to RAM to Network Buffers to Routers...culminating in the persistent storage of a brand new thing in the world, namely, my local replica of the bit-stream that the Library of Congress sent (replicated) to me. The instance count chez McGrath would increment and the instance count chez LOC would remain unchanged. I would have mine but they would still have theirs.

Sadly, the word "send" is used when we really mean "replication" and this is the source of untold confusion as it leads us to map physical-world concepts onto electronic world concepts where there is an imperfect fit...Have you ever sent and e-mail. I mean really "sent" an e-mail? Nope.

2 - The signing fallacy
An example of that imperfect fit is the concept of "signing". What would it mean for the Library of Congress to sign their physical copy of the Gutenburg Bible? They could put ink on a page or maybe imprint a page with an official embossing seal or some such. The nature of physical media makes it relatively easy to make the signing tamper-evident and hard to counterfeit.

What would it mean for the Library of Congress to sign their electronic replica of the Gutenburg Bible with PKI and replicate it (see point 1 above) to me? Well, its really very, very different from a physical signing.

It is just more bits. Every replica completes a completely perfect replica of the original "signature". There is no "original" to compare it too. The best you can do is check for "sameness" and check the origin of the replica but doing these checks rapidly becomes a complex web of hashes and certificates and revocations and trusted third parties and...lots of stuff that is not required for physical-world signatures.

3 - The semantics fallacy
What does it mean for me to render a page of my replica of the the Gutenburg Bible on my computer screen? Am I guaranteed to be seeing the "same" thing you see when you do something similar? Does it matter if the file is a TIFF or a Microsoft Word file? Does it matter what operating system I am using or what my current printer is or my screen resolution? Do any of these differences amount to anything when it comes to the true meaning of the page?

The unfortunate fact - as discussed earlier as part of the KLISS series - is that the semantics of information is sometimes a complex mix of the bits themselves and the rendering created from those bits by software.

Sometimes - for sure - the different renderings have no impact on meaning but it is fiendishly difficult to find consensus on where the dividing line is. Moreover, the signing fallacy (see above) adds to the problem by insisting that a document that passes the signing checks is "the same" as the replica it was replicated from. No account is taken of the fact that that perfect replicate at a bit-stream level may look completely different to me, depending on what software I use to render it and the operating context of the rendering operation.

Semantics in digital information is a complex function of the data bits, the algorithms used to process the bits, and the operating context in which the algorithms act in the bits. Consequently, the question "are these replicas 'the same'?" is not simple to answer...

4 - The either/or fallacy
...When someone asks me, as they sometimes do - and I quote - "How do I know that you sent me the original, authentic document?". I answer that it all depends on what you mean by the words "sent", "original", "authentic" and "document" :-)

Part of the problem is that fake/real, same/different are very binary terms. In the physical world, this not a huge problem. What are the chances that the Gutenburg Bible in the Library of Congress is a fake? I would argue that it is non-zero but extremely small. The same goes for ever dollar note, every passport, every drivers licence on the planet.

In the physical world, we can reduce the residual risk of fakes very effectively. In the electronic world, it is much much harder. How do I know that the replica of the Gutenburg Bible on my computer is not a fake? When you consider points 1,2 and 3 above I think you will see that it is not an easy question to answer...

What to do?

...It all looks quite complicated! Is there a sane way through this? Well, there had better be because, at least in the legal world, we seem to be heading rapidly into a situation where electronic texts of various forms are considered authentic/original/official/masters etc.

I personally believe that there are effective, pragmatic and inexpensive approaches that will work well, but we need to get out from under the terrible weight of unsuitable and downright misleading terminology we have foisted upon ourselves by stretching real world analogies way past their breaking points.

If I had my way "hashing" and "signing" would be utterly distinct. The term "non-repudiation" would be banned from all discourse. I would love to see all the technology around PKI re-factored to completely separate out encryption concerns from counterfeit detection concerns. The two right now feature some of the same tools/techniques, but the amount of confusion it causes is striking. I have lost count of the number of times I have encountered encryption as a proposed solution for counterfeit detection.

As time permits over the next while, I will be blogging more about this area and putting forward some proposed approaches for use in electronic legal publishing. I will also be talking about approaches that are applicable to machine readable data such as XML as well as frozen renderings such as PDF. A concept that is very important in the context of the data.gov/law.gov movements.

I expect pushback because I will be suggesting that we need to re-think the role of PKI and digital signatures and get past the dubious assertion that this stuff is necessarily complicated and expensive.

I truly believe that neither of these are true but it will take more time that I currently have to explain what I have in mind. Soon hopefully...

Friday, September 10, 2010

Sustainable data.gov initiatives in 4 easy steps

(Note: this post is largely directed at government agencies and businesses working with government agencies working on data.gov projects.)

At the recent Gov 2.0 Summit Ellen Miller expressed the concern that the data transparency initiative of the Obama administration has stalled. Wandering the halls at the conference, I heard some some assenting voices. Concerns that there is more style than substance. Concerns about the number of data sets, the accuracy of the data, the freshness of the data and so on.

Having said that, I heard significantly more positives than negatives about the entire data.gov project. The enthusiasm was often palpable over the two day event. The vibe I got from most folks there was that this is a journey, not a destination. Openness and transparency are the result of an on-going process, not a one-off act.

These folks know that you cannot simply put up a data dump in some machine readable format and call your openness project "done". At least at the level of the CIOs and CTOs, my belief is that there is a widespread appreciation that there is more to it than that. It is just not that simple. It will take time. It will take work but a good start is half the work and that is what we have right now in my opinion, a good start.

I have been involved in a number of open data initiatives over the years in a variety of countries. I have seen everything from runaway successes to abject failures and everything in between. In this post, I would like to focus on 4 areas that my experiences lead me to believe are critical to convert a good start into a great success story in open data projects.

1 - Put some of your own business processes downstream of your own data feeds

The father of lateral thinking, Edward de Bono was once asked to advise on how best to ensure a factory did not pollute a river. De Bono's solution was brilliantly simple. Ensure that the factory take its clean water *downstream* from the discharge point. This simple device put the factory owners on the receiving end of whatever they were outputting into the river. The application of this concept to data.gov projects is very simple. To ensure that your organization remains focused on the quality of the data it is pushing out, make sure that the internal systems consume it.

That simple feedback loop will likely have a very positive impact on data quality and on data timeliness.

2 - Break out of the paper-oriented publishing mindset

For much of the lifetime of most government agencies, paper has been the primary means of data dissemination. Producing a paper publication is expensive. Fixing a mistake after 100,000 copies have been printed is very expensive. Distribution is time consuming and expensive...

This has resulted – quite understandably – in a deeply ingrained "get it right first time" publishing mentality. The unavoidable by-product of that mindset is latency. You check, you double check, then you check again...all the while the information itself is sliding further and further from freshness. Data that is absolutely perfect - but 6 months too late to be useful - just doesn't cut it in the Internet age of instantaneous publishing.

I am not for a minute suggesting that the solution is to push out bad data. I am however suggesting that the perfect is the enemy of the good here. Publish your data as soon as it is in reasonable shape. Timestamp your data into "builds" so that your customers know what date/time they are looking at with respect to data quality. Leave the previous builds online so that your customers can find out for themselves, what has changed from release to release. Boldly announce on your website that the data is subject to ongoing improvement and correction. Create a quality statement. When errors are found – by your or by your consumers – they can be fixed with very little cost and fixed very quickly. This is what makes the electronic medium utterly different from the paper medium. Actively welcome data fixes. Perhaps provide bug bounties in the same way that Don Knuth does for his books. Harness Linus Torsvald's maxim that "given enough eyeballs all bugs are shallow" to shake bugs out of your data. If you have implemented point 1 above and your are downstream of your own data feeds, you will benefit too!

3 - Make sure you understand the value exchange

Whenever A is doing something that will benefit B but A is taking on the costs, there must be a value exchange for the arrangement to be sustainable. Value exchange comes in many forms:
- An entity a may provide data because it has been mandated to do it. The value exchange here is that the powers-that-be will smile upon entity A.
- An entity A may provide data out of a sense of civic duty. The value exchange here is that A actively wants to do it and receives gratification – internally or from peers - from the activity.
- An entity A may provide data because entity B will return the favor.
- And so on.

One of the great challenges of the public sector all over the world is that inter-agency data exchanges tend to put costs and benefits into different silos of money. If agency A has data that agency B wants, why would agency A spend resources/money doing something that will benefit B? The private sector often has similar value exchange problems that get addressed through internal cross-billing. i.e. entity A sees value in providing data to B because B will "pay" for it, in the internal economy.

If that sort of cross-billing is not practical – and in many public sector environments, it is not – there are a number of alternatives. One is reciprocal point-to-point value exchange. i.e. A does work to provide data to B, but in return B does work to provide data that A wants. Another – and more powerful model in my opinion – is a data pool model. Instead of creating bi-lateral data exchange agreements, all agencies contribute to a "pool" of data in a sort of "give a penny, take a penny" basis. i.e. feel free to take data but be prepared to be asked by the other members of the pool, to provide data too.

In scenarios where citizens or the private sector are the consumers of data, the value exchange is more complex to compute as it involves less tangible concepts like customer satisfaction. Having said that, the Web is a wonderful medium for forming feedback loops. Unlike in the paper world, agencies can cheaply and easily get good intelligence about their data from, for example, an electronic "thumbs up/down" voting system.

The bottom line is that value exchanges come in all shapes and sizes but in order to be sustainable, I believe a data.gov project must know what the value exchange is if it is going to be sustainable.

4 - Understand the Government to Citizen dividend that comes from good agency-to-agency data exchange

In the last point, I have purposely emphasized the agency-to-agency side of data.gov projects. Some may find that odd. Surely the emphasis of data.gov projects should be openness and transparency and service to citizens?

I could not agree more, but I believe that the best way to service citizens and businesses alike, is to make sure that agency-to-agency data exchange functions effectively too.

Think of it this way: how many forms have you filled in with information you previously provided to some other agency? We all know we need an answer for the "we have a form for that" phenomenon but I believe the right answer is oftentimes not "we have an app for that" but rather "there is no app, and no form, for that data because it is no longer necessary for you to send it to us at all".

Remember: The best Government form is the form that disappears in a puff of logic caused by good agency-to-agency data integration.

In summary

1 - Put yourself downstream of your own data.gov initiative
2 - Break out of the paper-oriented "it must be perfect" mindset
3 - Make sure you understand the value exchanges. If you cannot identify one that makes sense, the initiative will most likely flounder at some point and probably sooner than you imagine
4 – When government agencies get their agency-to-agency data exchange house in order, better government-to-citizen and government-to-business data exchange, is the result.

Thursday, September 09, 2010

The Web of Data and the Strata conference

I'm a grizzled document-oriented guy but I'm not blind to the amazing potential of numerical data on the Web. I do not think it is an exaggeration to say that in years to come, data volumes on the web-o-data will be, in order of size: multi-media data, numerical data and then text. Text will bring up the rear. A distant third behind numerical data, which in turn will be some distance behind multimedia data.

That is what I think the volume graph will look like but in terms of business value, I suspect a very different graph will emerge: numerical data - text data - multi-media data. In that order.

In blunt, simple terms, there is serious money in numbers and number crunching. As more and more numerical data becomes available on the web and is joined by telemetry systems (e.g. smart-grid) generating vast new stores of numerical data we are going to see an explosion of new applications. I had the good fortune to be involved in the early days of Timetric and they have now been joined by a slew of companies working on innovative new applications in this space.

At the Gov 2.0 conference that has just ended, I had the opportunity to talk to my fellow survivor of the gestation of XML, Edd Dumbill of O'Reilly who is involved in the Strata conference. Edd really gets it and I look forward to seeing what he pulls together for the Strata conference. Exciting times.