Thursday, June 24, 2010

KLISS: The Eventing Model and the Consistency Model

Last time in this KLISS series (and the time before that), I concentrated on the concept of names for information assets. This, seemingly peripheral concern is, in my view, critical in legislative informatics. I talked about how a well thought-out set of names, sitting on top of a "time machine" oriented persistence substrate, helps dramatically to meet many needs in legislatures/parliaments including rigorous citation and rigorous transparency audit-trail. Happily, a name-oriented focus sits very nicely on top of the world wide web architecture and, in particular, sits nicely with RESTian system architectures. (If REST is new to you, you might be interested in starting with this article I wrote some years ago and the resources referenced at the end.)

In this installment on KLISS, I want to turn to the closely related concept of events and how it fits in with the time machine model in the KLISS architecture. When I say KLISS is a time machine I do not mean that KLISS sits there, recording what is happening in real time 24x7. The reason being, that for long tracts of time, nothing actually happens because nothing is going on inside the "black box" that is the legislature/parliament. When I say nothing is going on, I mean that nobody is doing anything. There are no actors acting inside the black box. Therefore, there is nothing to record into the time machine. We call this a quiescent system state. We are like Beckett, waiting for Godot...

Now, as soon as somebody *acts*, the time machine persistence layer captures the act itself as a transaction against the time machine. The act could be the introduction of a bill, the explanation of a vote, a point of personal priviledge to be recorded in the journal, an update to the statute, a referral of a bill to a conference committee etc. Such acts rarely – if ever – stand alone. Picture an atom smasher. One "event" comes in and bang, many secondary events are triggered. These secondary events may trigger tertiary events and so on. Eventually, if there are no new primary events, the system quiesceses again.

To ground what follows in a practical scenario, consider what happens when a new bill is introduced. Here is a representative series of events that might occur...

  • A member in a chamber (the sponsor) gets permission to speak through the chair and announces the bill.
  • The relevant chamber clerk "calls" for the bill from legislative council/revisors office/bills office.
  • The event is recorded in the journal.
  • The new bill is allocated a new identifier and added to bill status.
  • Prints of the bill are called for, for each Member.
  • PDF (and possibly HTML) version are created and posted.

I am greatly generalizing and simplifying here, but I hope you can see how one event leads to a set of secondary events and how each of those secondary events may themselves produce more events.

At an IT architecture level, two main questions arise. Firstly, how do we arrange that all the interested entities get informed of the existence of a new event? Secondly, how do we arrange that the "views" of the state of the legislature are kept consistent across the various information assets that record the events? i.e. the the bill status system, the pdf of the bill, the HTML of the bill, the pdf of the journal. KLISS achieves both using an asynchronous XML-based messaging backbone. Every time the time machine is changed – an event notification is sent out and all interested sub-systems have the chance to act as they wish. Any acts taken can themselves trigger *further events* perhaps involving further transactions against the time machine.

This takes place asynchronously. That is very important and I'd like to explain it as it is critical to the model. Classical "database-think" operates on ACID principles. This is problematic in legislatures/parliaments (in fact, I believe it is problematic in most document-centric domains) because to achieve overall information "consistency" I need to update bill status pages, generate PDFs of journals, convert Bills to HTML and post them on websites, update the search indexes, push out the twitter updates etc. There is simply too much to do for me to be able to lock the entire repository, update everything and then free the lock. Even if I could, it would create significant temporal coupling between sub-systems. Temporal coupling is, in general, bad news. What if one of my sub-systems (say PDF generation) is running slow because of load or maybe offline because of a fault? I cannot afford to wait around for it to become available. I cannot fail the transaction simply because some sub-system is not in a consistent state with respect to the rest of the sub-systems. What to do?

Remember when I talked about the time-machine repository and the fact that each change – each transaction – has a unique revision number. Remember how I mentioned that the URIs to retrieve content from the repository include the revision numbers? Well, every event sent out by KLISS includes the revision number of the transaction. That way, sub-systems that receive the event can look at the repository as it looked at that revision number. i.e. at the timestamp when the revision occurred. Think of an Automatic Teller Machine. You put in your card and ask for a balance on an account. Does the machine tell you the balance as it is *right now*. No. It tells you the balance as it stood the moment the query hit the ledgers. One millisecond later, a million dollars might have hit your account. Does that make the printout you got from the ATM incorrect? No because the printout stipulates the timestamp that the ledger query happened. Maybe it was two milliseconds ago.Maybe it was two years ago. It does not matter. The printout is correct because it is locked to a point in time by the timestamp printed on it.

KLISS works the same way, all bills, all journals, all bill statute pages, all aggregated publications, all hyperlinks...encode point-in-time information. All views of the time-machine basically say "I was run when the time-machine repository was at revision 1234. Everything you see on this page, is correct as of revision 1234..."

This is critical because it removes a whole slew of otherwise very thorny problems. For example, what happens if I generate a page that tells me what bills are in committee X by looking into the time machine folder where the bills are stored. What if, 1 millisecond later, somebody moves the bill out of that committee? It doesn't matter because the first thing we do when generating any view of the repository is to find out what the current revision number is. Lets say it is revision 1234. All subsequent queries against the repository pass that revision number in. The view itself then displays a footer saying "Correct as of revision 1234 15:43, 2010010".

This model has a variety of names. Some call it idempotency and that is certainly part of it. i.e. given a URI with a revision number, KLISS will always, always, always return the same stream of bytes. It will never change. It is a classic candidate for a GET operation that has no side-effects on the information corpus. I prefer to use Werner Vogel's term "Eventually Consistent" to describe the model. KLISS allows individual sub-systems to update their views of the repository at their own speed. If all the events quiesce and all sub-systems are operational, then the complete vista of "views" over the repository contained in all the sub-systems will, eventually also quiesce and be consistent with each other. During normal operation, it is to be expected that some sub-systems will be updated later than others but their views are never wrong – they are simply reflective of an older time-point. As well as Amazon's Werner Vogels, the writings of Pat Helland of Microsoft on this subject are worth reading. Bottom line. Time is relative. You cannot really lock it down. Certainly at web-scale, distributed, federated systems there is no alternative but to embrace the relativity of time and work with it rather than fight against it. That is what KLISS does.

One final point on temporal decoupling before I wrap up...KLISS uses both fire-and-forget and guaranteed-delivery messaging semantics. In English what that means is that a sub-system that may or may not be online, or may need to run slower than other sub-systems never looses track of where the time machine is at. Messages generated for its attention are queued up and can be drawn-down at as leisurely a pace as required. Sub-systems can be taken down for maintenance and spun back up. When they spin back up any messages that they missed, are sitting there queued up to be consumed whenever. This makes high availability of a system as large as KLISS significantly easier as there are very few reasons why the system would ever need to be off-line. Individual services may go off line but the core of KLISS itself, just keeps on trucking... I think of it as Reed's End-to-End Argument applied at the application level. KLISS puts as little "smart" stuff in the center of the architecture as possible, leaving most of the customer-facing "smart" stuff out at the edges.

By now, I hope you are beginning to see that we do not do content management in KLISS in the classical "static" model of simply storing stuff in a repository-of-the-now. In KLISS

  • All content conforms to an enterprise information model. It is not just documents in folders. KLISS represents system actors and workflows and roles and committees etc. as *documents*.
  • All content is part of the time-machine model. In KLISS moving a bill into a committee or changing the phone number of the speaker Pro Tem is precisely the same repository operation as updating a piece of statute.
  • Any changes to the information model timestamped and communicated via persistent, asynchronous messaging to all sub-systems which can then use that timestamp to lock down time for their own interactions with the repository.

One final point, the event-oriented model in KLISS can be usefully conceptualized in terms of a formalism known as Speech Acts. During analysis phases, I find it very useful to separate my illocutions from my perlocutions as it helps me see where secondary and indeed N-ary event cascades are likely to happen. If the concept of speech acts flips your switch or (aspirates your fricatives), you might be interested in this article on the subject.

Next up: Organizing legislative material in KLISS.

No comments: