Sean McGrath

Tuesday, December 07, 2010

Towards mashups based on timelines

Dan Jellinek writes about a very important point Open Data 'Must Add Context'. Context is absolutely king when it comes to interpreting public records which is why KLISS works the way it works.

Lawrence Lessig has also written about the problem of context in his Against Transparency.

We will soon, I hope, get passed the problem of data access. OGD, data.gov, law.gov, legislation.co.uk etc. will see to that.

Then we can move onto addressing the context problem. To do that, we will need to address what is, to my mind, the key missing piece of the Web today: the time dimension of information.

It does not have to be complicated. I would suggest we start with some simple "social contracts" for URI's that contain temporal information. Tim Berners Lee's Cool URI's don't change has been around for many years now and contains what is to my mind the key idea: encoding dates in URIs. e.g. this URI signals the time dimension in its elements: http://www.w3.org/1998/12/01/chairs.

The notion of URIs having structure has been a wee bit controversial (See Axioms but I think its a fine idea :-) Jon Udel is worth reading on this point too.

So, where could a few simple agreements about temporal URI patterns get us?

In two words *timeline mashups*. Today, the majority of mashups are essentially data joins using location as the join point. Imagine a Web in which we can create similar dynamic data expositions but based on time lines. That is the prize we will win if we can get agreements on encoding the temporal dimension of information.

Imagine a world in which we can automatically generate beautiful information displays like this, or this that mashup data from many disparate, independent sources?

Would it be worth the effort? In my opinion, absolutely! It would be a great place to start, yielding huge value for a relatively small effort.

Higher up the effort scale but still very worthwhile would be mechanisms for querying w.r.t. time e.g. Memento and Temporal RDF.

How wonderful would it be if we could then create temporal mashups of temporal mashups? How wonderful would it be if we could create a temporal dimension *on top* of a geo-spatial dimension to create spatial-and-temporal mashups?

As Don Heiman, CITO for the Kansas State Legislature and the visionary behind KLISS likes to say: "be still my heart"...

Monday, November 01, 2010

KLISS Slides

We presented KLISS at last week's Cutter Consortium event in Boston. The slides are online here. The slides cover eDemocracy vision, business strategy, governance etc. as well as the tech parts.

More to come in the weeks and months ahead. Exciting times!

Monday, October 11, 2010

KLISS: Author/edit sub-systems in legislative environments

Last time in this KLISS series, I talked some more about the KLISS workflow model. The time has come (finally!) to talk about how that workflow model incorporates author/edit on the client side i.e. the creation of, or update of, legislative artifacts such as bills, statute sections, chamber events, meeting minutes etc. Earlier on in this series, I explained in reasonable detail, why the author/edit subsystems cannot be simple little text editors and also why they cannot be classic heavy XML editors so I won't go over that ground again here, I'll just cut to the chase about how KLISS actually does it.

Units of information and micro-documents

To my mind, the most important part of modeling a document-centric world such as legislatures for author/edit is decided what the boundaries are between information objects. After all, at some level, some information object needs to be edited. We will need all the classic CRUD functions for these objects so we need to pick them carefully.

When I look at a corpus of legal information I see a fractal in which the concept of "document" exhibits classic self-similarity. Is a journal a document? How about a title of statute? Or a bill? How about the Uniform Commercial Code. Is that a document?

Pretty much any document in a legislature can reasonably be thought of as an aggregation of smaller documents. A journal is an iteration of smaller chamber event documents. A title of statute is an iteration of statute sections. A volume of session laws is an iteration of acts...and so on.

This creates an interesting example of a so called banana problem. How do you know when to stop de-composing a document into smaller pieces?

My rule of thumb is to stop decomposing when the information objects created by the decomposition cease to be very useful in stand-alone form. Sections of statute are useful stand-alone. The second half of a vote record less so. Bills are useful standalone. The enacting clause less so.

The good news is that when you do this information de-composition, the size of the information objects that require direct author/edit support get smaller and they get less numerous. They get smaller because you do not need an editor for titles of statute. A title is what you get after you aggregate lots of smaller documents together. Don't edit the aggregate. Edit the atoms. They get less numerous because the decomposition exposes many shared information objects. For example, the bill amendment may be a document used in chamber but it is also in the journal. Referring a bill to a committee will result in a para in the journal but will also result in an entry in the bill status application...and so on.

In KLISS we generally edit units of information – not aggregates. We have a component that knows how to join together any number of atoms to create aggregates. Moreover, aggregates can be posted into the KLISS time machine where they become atoms, subject to further aggregation. A good example would be a chamber event document that gets aggregated into a journal but the resultant journals are themselves aggregated into a session publication known as the permanent journal.

Semantics and Micro-formats

KLISS makes extensive use of ODF for units of information in the asset repository. We encode metadata as property-value pairs inside the ODF container. We also leverage paragraph and character style names for encoding "block" and "inline" semantics. As discussed previously line and page numbers are often critically important to the workflows and we embed these inside ODF markup too.

The thin client author/edit approach

Some of our units of information are sufficiently small and sufficiently metadata-oriented that we can "author" them using Web-based forms. In other words, the asset will be stored in the time machine as an ODT document but the users author/edit experience of it will be a web form. We make extensive use of django for these.

This is particularly true on the committee/chamber activity side of a legislature where there are a discrete number of events that make up 80% of the asset traffic and the user interface can be made point-and-click with little typing.

The thick client author/edit approach

Some of our units of information are classic word-processor candidates. i.e. a 1500 page bill, a 25 page statute section consisting of a single landscape table with running headers, a 6 level deep TOC with tab leaders and negative first line indents...For these we use a thick client application created using Netbeans RCP which embeds OpenOffice. We make extensive use of the UNO API to automate OpenOffice activities. The RCP container also handles identity management, role based access control and acts as a launchpad for mini-applications – created in Java and/or Jython – that further extend our automation and customization capabilities on the client side.

RESTian time-machine interface

Although we tend to speak of two clients for author/edit in KLISS – the thick client and the thin client – in truth, the set of clients is open-ended as all interaction with the KLISS time machine is via the same RESTian interface. In fact, the KLISS server side does not know what was used to author/edit any document. This gives us an important degree of separation between the author/edit subsystem and the rest of the system. History has shown, that the most volatile part of any software application is the part facing the user. We need to know that we can evolve and create brand new author/edit environments without impacting the rest of the KLISS ecosystem.

Why ODT?

ODT was chosen as it presented the best trade off when all the competing requirements of an author/edit subsystem for legislatures were analyzed. A discussion of the issues on the table when this selection was made are listed here. To that list I would also add that ODT is, to my knowledge, without IP encumberments. Also, the interplay between content and presentation is so important in this domain that it is vital to have free and unfettered access to the rendering algorithms in order to feel fully in possession of the semantics of the documents. I'm not saying that a large corpus of C++ is readily understandable at a deep level but I take great comfort in knowing that I can, in principle, know everything there is to know about how my system has rendered the law by inspecting the rendering algorithms in OpenOffice.

Next up, long term preservation and authentication of legal materials in KLISS.

Thursday, October 07, 2010

ActiveMQ guru required

Looking for an ActiveMQ guru to work 6 month contract out of Lawrence, KS. Possibility for full time post after that for the right person.

Thursday, September 30, 2010

Cutter Consortium Summit, Boston

I will be attending the Cutter Consortium Summit in Boston next month. If you are attending, or in the area, and would like to meet up, let me know.

Friday, September 24, 2010

law.gov gets funding

Congrats to Carl Malamud on the funding for law.gov from Google.

Thursday, September 16, 2010

Pssst...there is no such thing as an authentic/original/official/master electronic legal text

I know of no aspect of legal informatics this is plagued with more terminology problems than the question of authentic/original/official/master versions of legal material such as bills and statute and caselaw. In this attempt at addressing some of the confusion, I run the risk of adding more confusion but here goes...

1 - The sending fallacy
What would it mean for the Library of Congress to send their Gutenburg Bible to me? Well, they would put it in a box and ship it to me. Afterwards, I would have that instance of the Gutenburg Bible and they would not have it. The total number of instances of the Gutenburg Bible in the world would remain the same. The instance count chez McGrath would increment and the instance count chez LOC would decrement.

If they were to electronically send it to me, there would be no "sending" going on at all. Instead, a large series of replications would happen - from storage medium to RAM to Network Buffers to Routers...culminating in the persistent storage of a brand new thing in the world, namely, my local replica of the bit-stream that the Library of Congress sent (replicated) to me. The instance count chez McGrath would increment and the instance count chez LOC would remain unchanged. I would have mine but they would still have theirs.

Sadly, the word "send" is used when we really mean "replication" and this is the source of untold confusion as it leads us to map physical-world concepts onto electronic world concepts where there is an imperfect fit...Have you ever sent and e-mail. I mean really "sent" an e-mail? Nope.

2 - The signing fallacy
An example of that imperfect fit is the concept of "signing". What would it mean for the Library of Congress to sign their physical copy of the Gutenburg Bible? They could put ink on a page or maybe imprint a page with an official embossing seal or some such. The nature of physical media makes it relatively easy to make the signing tamper-evident and hard to counterfeit.

What would it mean for the Library of Congress to sign their electronic replica of the Gutenburg Bible with PKI and replicate it (see point 1 above) to me? Well, its really very, very different from a physical signing.

It is just more bits. Every replica completes a completely perfect replica of the original "signature". There is no "original" to compare it too. The best you can do is check for "sameness" and check the origin of the replica but doing these checks rapidly becomes a complex web of hashes and certificates and revocations and trusted third parties and...lots of stuff that is not required for physical-world signatures.

3 - The semantics fallacy
What does it mean for me to render a page of my replica of the the Gutenburg Bible on my computer screen? Am I guaranteed to be seeing the "same" thing you see when you do something similar? Does it matter if the file is a TIFF or a Microsoft Word file? Does it matter what operating system I am using or what my current printer is or my screen resolution? Do any of these differences amount to anything when it comes to the true meaning of the page?

The unfortunate fact - as discussed earlier as part of the KLISS series - is that the semantics of information is sometimes a complex mix of the bits themselves and the rendering created from those bits by software.

Sometimes - for sure - the different renderings have no impact on meaning but it is fiendishly difficult to find consensus on where the dividing line is. Moreover, the signing fallacy (see above) adds to the problem by insisting that a document that passes the signing checks is "the same" as the replica it was replicated from. No account is taken of the fact that that perfect replicate at a bit-stream level may look completely different to me, depending on what software I use to render it and the operating context of the rendering operation.

Semantics in digital information is a complex function of the data bits, the algorithms used to process the bits, and the operating context in which the algorithms act in the bits. Consequently, the question "are these replicas 'the same'?" is not simple to answer...

4 - The either/or fallacy
...When someone asks me, as they sometimes do - and I quote - "How do I know that you sent me the original, authentic document?". I answer that it all depends on what you mean by the words "sent", "original", "authentic" and "document" :-)

Part of the problem is that fake/real, same/different are very binary terms. In the physical world, this not a huge problem. What are the chances that the Gutenburg Bible in the Library of Congress is a fake? I would argue that it is non-zero but extremely small. The same goes for ever dollar note, every passport, every drivers licence on the planet.

In the physical world, we can reduce the residual risk of fakes very effectively. In the electronic world, it is much much harder. How do I know that the replica of the Gutenburg Bible on my computer is not a fake? When you consider points 1,2 and 3 above I think you will see that it is not an easy question to answer...

What to do?

...It all looks quite complicated! Is there a sane way through this? Well, there had better be because, at least in the legal world, we seem to be heading rapidly into a situation where electronic texts of various forms are considered authentic/original/official/masters etc.

I personally believe that there are effective, pragmatic and inexpensive approaches that will work well, but we need to get out from under the terrible weight of unsuitable and downright misleading terminology we have foisted upon ourselves by stretching real world analogies way past their breaking points.

If I had my way "hashing" and "signing" would be utterly distinct. The term "non-repudiation" would be banned from all discourse. I would love to see all the technology around PKI re-factored to completely separate out encryption concerns from counterfeit detection concerns. The two right now feature some of the same tools/techniques, but the amount of confusion it causes is striking. I have lost count of the number of times I have encountered encryption as a proposed solution for counterfeit detection.

As time permits over the next while, I will be blogging more about this area and putting forward some proposed approaches for use in electronic legal publishing. I will also be talking about approaches that are applicable to machine readable data such as XML as well as frozen renderings such as PDF. A concept that is very important in the context of the data.gov/law.gov movements.

I expect pushback because I will be suggesting that we need to re-think the role of PKI and digital signatures and get past the dubious assertion that this stuff is necessarily complicated and expensive.

I truly believe that neither of these are true but it will take more time that I currently have to explain what I have in mind. Soon hopefully...