Sean McGrath: 06/20/2010

Saturday, June 26, 2010

KLISS: Organizing legislative material in legislatures/parliaments

Last time in this KLISS series, I talked about the event model in KLISS. I also talked about how it works in concert with the "time machine" model to achieve information consistency in all the "views" of legislative information required for a functioning legislature/parliament. For example a bill statute view, a journal view, a calendar view, and amendment list view, a committee view etc...

I am using the word "view" here is a somewhat unusual way so I would like today to explain what I mean by it. Doing that will help set the scene for an explanation of how legislative/parliamentary assets are organized in the KLISS repository and how metadata-based search/retrieval over the repository works.

It goes without saying (but I need to say it in order to communicate that it need not be said (ain't language wonderful?)), that legislatures/parliaments produce and consume vast amounts of information, mostly in document form. What is the purpose of the documents? What are they for really? In my view, they serve as snapshot containers for the fundamental business process of legislatures/parliaments, which is the making of law. In other words, a document in a legislature is a business process, snapshotted, frozen at a point in time.

By now, if you have been reading along in this KLISS series, you will know that it is very much a document-centric architecture. The documents themselves, in all their presentation-entangled, semi-structured glory, are treated as the primary content. We create folders, and folders inside folders. We create documents with headings and headings inside headings and we put these into folders. We then blur the distinction between folder navigation (inter-document) and heading "outline" navigation (intra--document) so that the whole corpus can be conceptualized as a single hierarchical information store. The entire state of a legislature/parliament, is in KLISS, *itself* a document – albeit a very large one! Simply put, KLISS does not care about the distinction between a folder and a heading. They are both simply hierarchical container constructs.

In KLISS a "view" is simply a time-based snapshot generated from the enormous document that is the repository, seen at a point in time, in some required format. So, a PDF of a bill is such a snapshot view. So too is a the HTML page of a committee report, a journal, a corpus of promulgated law etc. HTML, PDF, CSV, there are all the same in the KLISS information model. They are just views, taken at a point in time, out of the corpus as a whole.

Earlier in this series I talked about how the web blurs the distinction between naming something to pick it out and performing a query to pick it out. KLISS takes advantage of that blurring in the creation of views. So much so that a consumer of a KLISS URI cannot tell if the resource being picked out is "really there" or the result of running a query against the repository.

The hierarchical information model in KLISS has been strongly influenced by Hebert Simon and his essay The Architecture of Complexity. The view/query model is a sort of mashup of ideas from Bertrand Russell (proper nouns as query expressions) and John Kripke (rigid designators) combined with the Web Architecture of Sir Tim Berners Lee.

The most trivial views over the KLISS repository are those that correspond to real bytes-on-the-disk documents. Bills are generally like that. So too are votes. So too are sections of statute. Another level of views are those generated dynamically by assembling documents into larger documents. Volumes of statute are like that. Journals are like that. Once assembled, these documents often go back into the repository as real bytes-on-the-disk documents. This creates a permanent record of the result of the assembly process but it also allows the assemblies to be, themselves part of further assemblies. Permanent journals are like that. Final calendars are like that. Chronologies of statutes are like that.

Yet another level of views are those generated from the KLISS meta-data model...In KLISS, any document in the system can have any number of property/value pairs associated with it. When transactions are stored in the repository, these property/value pairs are loaded into a relational database behind the scenes. This relational database is used by the query subsystem to provide fast, ordered views over the repository. The sort of queries enabled are things like:

Give me all the bill amendments tabled between dates X and Y
Give me all the sponsors for all bills referred to the Agriculture committee last session
Give me all bills with the word "consolidation" in their long titles
How many enrolled bills have we so far this session?
etc.

At this point I need to point out that although we use a relational database as the meta-data indexer/query engine in KLISS, we do not use it relationally. This is by design. At this core level of the persistence model, we are not modeling relationships *between* documents. Other levels provide that function (we will get to them later on.). Effectively what we do is utilize a Star schema in which (URI+Revision Number) is the key used to join together all the metadata key, value pairs. The tabular structure of the meta-data fields is achieved via a meta-modeling trick in which the syntax of the field name, indicates what table and what field and what field type should be used for the associated value. In the future, we expect that we will gravitate away from relational back-ends into more non relational stores that are thankfully, finally, beginning to become commonplace.

It is important to note that in KLISS, the meta-data database is not a normative source of information. The master copy of all data is, at all times in the documents themselves. The metadata is stored in the documents themselves (the topic of an upcoming post). The database is constructed from the documents in order to serve search and retrieval needs. That is all. In fact, the database can be blown away and simply re-created by replaying the transactions from the KLISS time machine. I sometimes explain it by saying we use a database in the same way that a music collection application might use a database. Its purpose is to facilitate rapid slicing/dicing/viewing via meta-data.

This brings me to the most important point about how information is organized in KLISS. Lets step all the way back for a moment. Why do us humans organize stuff at all? We organize in order to find it again. In other words, organization is not the point of organization. Retrieval is the point of organization. Organization is something we do now, in anticipation of facilitating retrieval in the future. For most of human history, this has meant creating an organizational structure and packing stuff physically into that structure. Shoe closets, cities, pockets, airplanes, filing cabinets, filo-faxes, bookshelves, dewey decimal classification...

As David Weinberger explains in his book "Everything is Miscellaneous", there is no need for a single organizational structure for electronic information. A digital book does not need exactly one shelf on one wall, classified under one dominant heading. It can be on many shelfs, on may walls under many headings, in many ontologies, all at the same time. In fact, it can be exploded into pieces, mashed up with other books and represented in any order, in any format, any where and any time. Not only is this possible thanks to IT, it cannot be stopped. All known attempts – and their have been numerous – since the dawn of IT have failed to put the organization genie back in the bottle...

Having said that, the tyranny of the dominant decomposition appears, per Herbert Simon to be woven into the fabric of the universe. In order to store information – even electronically - we must *pick* at least some organizational structure to get us started. At the very least, things need to have names right? Ok. What form will those names take...Ten minutes into that train of thought and you have a decomposition on your hands. So what decomposition will be pick for our legislative/parliamentary materials? Do committees contain bills or do bills contain committees? Is a joint committee part of the house data model or part of the senate data model or both? Are bill drafts stored with the sponsor or with the drafter? Are committee reports part of the committee that created them or part of the bills they modify? etc. etc...One hour later, you are in a mereotopology induced coma. You keep searching for the perfect decomposition. If you are in luck, you conclude that there is no such thing as the perfect decomposition and you get on with your life. If you are unlucky, you get drafted into a committee that has to decide on the correct decomposition.

Fact of life: If there are N people in a group tasked with deciding an information model, there are exactly N, mutually incompatible models vying for dominance and each of the N participants is convinced that the other N-1 models are less correct than their own. Legislatures/parliaments provide and excellent example of this phenomenon. Fill a room with drafting attorneys, bill status clerks, journal clerks, committee secretaries, fiscal analysts and ask each of them to white-board their model of, for example bills, you will get as many models as there are people in the room.

That is why, in KLISS, by design, the information model – how it carves up into documents versus folders, paragraphs versus meta-data fields, queries versus bytes-on-the-disk does not really matter. Just pick one! There are many, many models that can work. Given a set of models that will work, there is generally no compelling reason to pick any particular one. In legislatures/parliaments – as in many other content-centric applications the word "correct" needs a pragmatic definition. In KLISS, we consider an information model to be "correct" if it supports the efficient, secure production of the required outputs with the required speed of production. That is essentially it. Everything else is secondary and much of it is just mereotopology.

Two more quick things before I wrap up for today. You may be thinking, "how can a single folder structure hope to meet the divergent needs of all the different stakeholders who likely have different models in their head for how the information should be structured?" The way KLISS does it is that we create synthetic folder structures – known as "virtual views" – over the physical folder structure. That allows us to create the illusion – on a role by role basis – that each group's preferred structure is the one the system uses :-)

As well as helping to create familiar folder structures on a role-by-role basis, virtual views also allow us to implement role based access control. Every role in the system uses a virtual view. Moreover, all event notifications use the virtual views and all attempted access to assets in the repository are filtered through the users virtual view - that includes all search results.

To sum up...KLISS uses a virtualized hierarchical information model combined with property/value pairs arranged in a star-schema fashion. Properties are indexed for fast retrieval and based on scalar data types that we leverage for query operators e.g. date expression evaluation, comparisons of money amounts etc. The metadata model is revision based and the repository transaction semantics guarantee that the metadata view is up to date with respect to the time machine view at all times. All event notifications use the virtual view names for assets.

You may be wondering, "is it possible to have a document with no content other than metadata?". The answer is "yes". That is exactly how we reify non-document concepts like committees, members, roles etc. into document form for storage in the time machine. Yes, in KLISS, *everything* is a document:-)

Next up: Data models, data organization and why the search for the "correct" model is doomed.

Thursday, June 24, 2010

KLISS: The Eventing Model and the Consistency Model

Last time in this KLISS series (and the time before that), I concentrated on the concept of names for information assets. This, seemingly peripheral concern is, in my view, critical in legislative informatics. I talked about how a well thought-out set of names, sitting on top of a "time machine" oriented persistence substrate, helps dramatically to meet many needs in legislatures/parliaments including rigorous citation and rigorous transparency audit-trail. Happily, a name-oriented focus sits very nicely on top of the world wide web architecture and, in particular, sits nicely with RESTian system architectures. (If REST is new to you, you might be interested in starting with this article I wrote some years ago and the resources referenced at the end.)

In this installment on KLISS, I want to turn to the closely related concept of events and how it fits in with the time machine model in the KLISS architecture. When I say KLISS is a time machine I do not mean that KLISS sits there, recording what is happening in real time 24x7. The reason being, that for long tracts of time, nothing actually happens because nothing is going on inside the "black box" that is the legislature/parliament. When I say nothing is going on, I mean that nobody is doing anything. There are no actors acting inside the black box. Therefore, there is nothing to record into the time machine. We call this a quiescent system state. We are like Beckett, waiting for Godot...

Now, as soon as somebody *acts*, the time machine persistence layer captures the act itself as a transaction against the time machine. The act could be the introduction of a bill, the explanation of a vote, a point of personal priviledge to be recorded in the journal, an update to the statute, a referral of a bill to a conference committee etc. Such acts rarely – if ever – stand alone. Picture an atom smasher. One "event" comes in and bang, many secondary events are triggered. These secondary events may trigger tertiary events and so on. Eventually, if there are no new primary events, the system quiesceses again.

To ground what follows in a practical scenario, consider what happens when a new bill is introduced. Here is a representative series of events that might occur...

A member in a chamber (the sponsor) gets permission to speak through the chair and announces the bill.
The relevant chamber clerk "calls" for the bill from legislative council/revisors office/bills office.
The event is recorded in the journal.
The new bill is allocated a new identifier and added to bill status.
Prints of the bill are called for, for each Member.
PDF (and possibly HTML) version are created and posted.

I am greatly generalizing and simplifying here, but I hope you can see how one event leads to a set of secondary events and how each of those secondary events may themselves produce more events.

At an IT architecture level, two main questions arise. Firstly, how do we arrange that all the interested entities get informed of the existence of a new event? Secondly, how do we arrange that the "views" of the state of the legislature are kept consistent across the various information assets that record the events? i.e. the the bill status system, the pdf of the bill, the HTML of the bill, the pdf of the journal. KLISS achieves both using an asynchronous XML-based messaging backbone. Every time the time machine is changed – an event notification is sent out and all interested sub-systems have the chance to act as they wish. Any acts taken can themselves trigger *further events* perhaps involving further transactions against the time machine.

This takes place asynchronously. That is very important and I'd like to explain it as it is critical to the model. Classical "database-think" operates on ACID principles. This is problematic in legislatures/parliaments (in fact, I believe it is problematic in most document-centric domains) because to achieve overall information "consistency" I need to update bill status pages, generate PDFs of journals, convert Bills to HTML and post them on websites, update the search indexes, push out the twitter updates etc. There is simply too much to do for me to be able to lock the entire repository, update everything and then free the lock. Even if I could, it would create significant temporal coupling between sub-systems. Temporal coupling is, in general, bad news. What if one of my sub-systems (say PDF generation) is running slow because of load or maybe offline because of a fault? I cannot afford to wait around for it to become available. I cannot fail the transaction simply because some sub-system is not in a consistent state with respect to the rest of the sub-systems. What to do?

Remember when I talked about the time-machine repository and the fact that each change – each transaction – has a unique revision number. Remember how I mentioned that the URIs to retrieve content from the repository include the revision numbers? Well, every event sent out by KLISS includes the revision number of the transaction. That way, sub-systems that receive the event can look at the repository as it looked at that revision number. i.e. at the timestamp when the revision occurred. Think of an Automatic Teller Machine. You put in your card and ask for a balance on an account. Does the machine tell you the balance as it is *right now*. No. It tells you the balance as it stood the moment the query hit the ledgers. One millisecond later, a million dollars might have hit your account. Does that make the printout you got from the ATM incorrect? No because the printout stipulates the timestamp that the ledger query happened. Maybe it was two milliseconds ago.Maybe it was two years ago. It does not matter. The printout is correct because it is locked to a point in time by the timestamp printed on it.

KLISS works the same way, all bills, all journals, all bill statute pages, all aggregated publications, all hyperlinks...encode point-in-time information. All views of the time-machine basically say "I was run when the time-machine repository was at revision 1234. Everything you see on this page, is correct as of revision 1234..."

This is critical because it removes a whole slew of otherwise very thorny problems. For example, what happens if I generate a page that tells me what bills are in committee X by looking into the time machine folder where the bills are stored. What if, 1 millisecond later, somebody moves the bill out of that committee? It doesn't matter because the first thing we do when generating any view of the repository is to find out what the current revision number is. Lets say it is revision 1234. All subsequent queries against the repository pass that revision number in. The view itself then displays a footer saying "Correct as of revision 1234 15:43, 2010010".

This model has a variety of names. Some call it idempotency and that is certainly part of it. i.e. given a URI with a revision number, KLISS will always, always, always return the same stream of bytes. It will never change. It is a classic candidate for a GET operation that has no side-effects on the information corpus. I prefer to use Werner Vogel's term "Eventually Consistent" to describe the model. KLISS allows individual sub-systems to update their views of the repository at their own speed. If all the events quiesce and all sub-systems are operational, then the complete vista of "views" over the repository contained in all the sub-systems will, eventually also quiesce and be consistent with each other. During normal operation, it is to be expected that some sub-systems will be updated later than others but their views are never wrong – they are simply reflective of an older time-point. As well as Amazon's Werner Vogels, the writings of Pat Helland of Microsoft on this subject are worth reading. Bottom line. Time is relative. You cannot really lock it down. Certainly at web-scale, distributed, federated systems there is no alternative but to embrace the relativity of time and work with it rather than fight against it. That is what KLISS does.

One final point on temporal decoupling before I wrap up...KLISS uses both fire-and-forget and guaranteed-delivery messaging semantics. In English what that means is that a sub-system that may or may not be online, or may need to run slower than other sub-systems never looses track of where the time machine is at. Messages generated for its attention are queued up and can be drawn-down at as leisurely a pace as required. Sub-systems can be taken down for maintenance and spun back up. When they spin back up any messages that they missed, are sitting there queued up to be consumed whenever. This makes high availability of a system as large as KLISS significantly easier as there are very few reasons why the system would ever need to be off-line. Individual services may go off line but the core of KLISS itself, just keeps on trucking... I think of it as Reed's End-to-End Argument applied at the application level. KLISS puts as little "smart" stuff in the center of the architecture as possible, leaving most of the customer-facing "smart" stuff out at the edges.

By now, I hope you are beginning to see that we do not do content management in KLISS in the classical "static" model of simply storing stuff in a repository-of-the-now. In KLISS

All content conforms to an enterprise information model. It is not just documents in folders. KLISS represents system actors and workflows and roles and committees etc. as *documents*.
All content is part of the time-machine model. In KLISS moving a bill into a committee or changing the phone number of the speaker Pro Tem is precisely the same repository operation as updating a piece of statute.
Any changes to the information model timestamped and communicated via persistent, asynchronous messaging to all sub-systems which can then use that timestamp to lock down time for their own interactions with the repository.

One final point, the event-oriented model in KLISS can be usefully conceptualized in terms of a formalism known as Speech Acts. During analysis phases, I find it very useful to separate my illocutions from my perlocutions as it helps me see where secondary and indeed N-ary event cascades are likely to happen. If the concept of speech acts flips your switch or (aspirates your fricatives), you might be interested in this article on the subject.

Next up: Organizing legislative material in KLISS.

Tuesday, June 22, 2010

Honi soit qui mal y pense

France 1 South Africa 2

Monday, June 21, 2010

KLISS: rigid designation URIs in KLISS

Last time, I talked about rigid designators and their role in the KLISS architecture.

First the bad news. Before continuing with this post, I have a confession to make. I do not believe there is such a thing as a truly rigid designator - certainly not in a form that can be implemented in technology. There are simply too many uncontrollable contextual variables that would need to be locked down and the world – specifically the Web – simply does not work that way. I won't go into the details here. (That is a topic for a bar conversation sometime. If your happen to be interested: take the causal chain that underlies network protocols, mix in sunyata with Saussure's structuralism and Wittgenstein's concept of language games, you end up in the ballpark of my objections to the existence of truly rigid designators.)

Now the good news. The web makes it possible to get closer than ever before because names on the Web are actionable – you can click on them...That sounds trivial but it is not. What is the Web really? (I mean really?). It is an unspeakably large set of names combined with the machinery necessary to pick out units of texts given those names.

The names are called URIs.
The units of text are called resources (or, strictly speaking, representations of resources.)
The machinery for picking out a unit of text (resource) given a name (URI) is the http protocol.

In order to add as much rigidity to our designators as possible for law, we need to find a way to de-contextualize the referring process as much as possible i.e. we do not want the unit of text we get back when we de-reference to be dependent on the context in which the de-referencing takes place. i.e. we retrieve it today, we retrieve it tomorrow, user A retrieves it or user B retrieves it, it is retrieved from continent A or from continent B, from browser A or browser B, in format A or format B...we would like all those referring variants to return the same unit of text.

This is tough. A truly wicked problem. However, we can make tremendous strides towards rigidity on the Web by just attacking one dimension of the problem. Namely, time.

Now you might be thinking "I see where this is going...we add date-stamps into our URIs so that they don't change and we have removed the time dimension from the context". That would certainly be a step in the right direction for law on the web where link rot is a major problem. However, it does not address the full problem. In a corpus of law, I am interested in being able to reference units of text locked down in time but I also want to be able to see the entire corpus as it was at that same moment in time. In other words, if I grab HB2154 as it looked at noon, 1 Jan, 2010, to fully understand what I'm looking at, I would like to be able to see what everything else in the corpus looked like at noon, 1 Jan, 2010.

Why is this "point in time" snapshot important in legislatures/parliaments? A number of reasons

Democratic transparency
Accuracy of aggregate publications
Accuracy of real-time displays
Legislative intent
History in the context of its creation

Let us take each of these in turn.

Democratic transparency

The ability to see moment-by-moment snapshots of the entire recorded democratic process and move seamlessly forward and backwards in time time watching how things changed...following hyper-links from bills to sponsors to committees to journals to statute to committee testimony...not only across documents but also through time...I can do that if I have the ability to re-create the corpus as it looked at any point in time and if I have designed by systems so that all inter-document linkages are expressed in terms of time-oriented URIs.

It is possible to have a legislative/parliamentary process by just publishing law. However, to have a truly democratic legislative/parliamentary process, you need to add participation and transparency. I mentioned earlier that a legislature/parliament can be thought of as a machine that takes the corpus of law at time T and produces a new corpus of law at time T+1. A democratic legislature/parliament does not simply announce the new corpus. It exposes how the new corpus was arrived at. I know of no more powerful way of doing that with technology than recording the moment-by-moment activity and allowing that activity to be "played back" and interpreted moment-by-moment.

Accuracy of aggregate publications

Legislatures/parliaments are awash with aggregate publications. (I discussed these in an earlier post). Bills pull in ("transclude" in the geek vernacular) statute. Journals pull in votes. Statute volumes pull in statute sections, case annotations etc. From an accuracy perspective, it is very important to be able to look at a Bill or a Journal and ask "What units of text actually got aggregated here?". It is very important to be able to ask "what votes were actually in the repository at the time that these 6 were added into the journal?" Armed with the ability to reach into the corpus as it looked at a particular point in time, allows this to be done accurately. In other words, time can be locked down so that the publication is correct as of the time the report was run.

Accuracy of real-time displays

Most legislatures/parliaments produce formal outputs at timescales established in the early days of printing e.g. 24 hour turnaround times necessitated by the need to typeset and then print to paper – generally offsite. In the last 20-30 years, IT has made it possible to shrink these timescales and most legislatures/parliaments operate some form of "real-time" information dissemination such as live audio/video, live bill status screens and so on. Even those that do not provide real-time information, find that a visitors gallery is all it takes these days for worldwide information dissemination - at the speed of light - via twitter and blogs etc.
I think of these "real time" displays as very fast aggregate publications. In fact, from an engineering perspective, they are built the same way in KLISS. So, just as it is vital to know what votes where in the system when it is decided that 6 can go into the journal, it is vital to know what all the actions were against HB1234 in the system at the time that 12 were identified to be listed on the bill status screen. Again, armed with the ability to reach into the corpus as it looked at a particular point in time, this can be done accurately. In other words, time can be locked down so that the publication is correct as of the time the report was run.

Legislative intent

Legislative intent is really a subtopic of Democratic transparency. It is important when establishing intent, to have the full context under which decisions were made. Again, I know of no more powerful technological mechanism for supporting this than the ability to re-create a legislature/parliament as it looked at the moment a decision was taken. Obviously, the more context we can pour into the repository of content, the better the ability to re-create the context becomes. More on that point in a moment...

History in the context of its creation

Finally, there is preservation of history as it is being created. Since the dawn of written law, people have been writing laws down and others have afterwards been poring over the writings trying to "fill in the blanks" to understand the why of it all, to get inside the minds of those who participated in the making of legislative history. In this day and age, there is no need for us to leave blanks in the record. Recording Bismarck's sausage machine purely by its final outputs : laws, journals etc. is somewhat like that. Why not record everything that is non-confidential as it happens? Come to think of it, why not facilitate reporting what is going on, as it is going on rather than at the end of the day or the end of the session?

At first blush, all this might seem too much of a technological leap. I contend that it is not. It is a *lot* of work, but no great new technological breakthroughs are required. The reason I so contend is that we can combine Web technologies with the delta-oriented algorithms of source code control systems to provide an excellent basis for the sort of legislative corpus management system required to achieve the above. Here is what is needed:

A repository that "snapshots" the entire state of the system at every change (revision) and allows for the repository to be re-constituted at it looked at any previous revision
A RESTian interface to that repository that includes revision numbers so that point-in-time views of individual assets or the entire repository as it was at a point in time, can be retrieved.

That is, essentially, how content is persisted and then accessed in KLISS.

The entire name space of documents assets is exposed as URIs.
The URIs include revision numbers allowing all previous versions of any asset, since the dawn of time to be retrieved.
Given an arbitrary timestamp, it is possible to extract the complete repository, as it looked at that timestamp.

Now you might be thinking that this time machine-like view into the history of a legislature/parliament can only contain some of the context required to fully understand what happened, who did what and, ultimately help determine why things happened the way they did. That is true but the context it contains is significantly extended by the fact that in KLISS, pretty much *everything* is a document (or a folder of documents). Chambers are modeled as folders/documents. Members are folders/documents. Committees are folders/documents. Why? Because then, they are first class members of the "time machine". I can re-create the movements of bills through committees by looking back in time at how my committee folders (which are exposed as URIs) changed with respect to time. I can recreate what the entire state of play was at the time a vote was called. Moreover, if I wish, I can replay the events (which of course are all URI posts) in order to re-run history and watch it unfold on front of our eyes...

In summary, in KLISS, names are really really important. We make names are rigid as we can within the confines of what is technologically feasible. We map all non-confidential names directly into URI-space and we add the time dimension to each URI to allow not only the retrieval of an individual asset at a point it time but *all* assets as they stood at that point in time. We do this at a technical level by leveraging the delta algorithms commonplace today in good source code control systems, combined with HTTP and a time-oriented twist on a RESTian interface.

Next up: event generation in KLISS and its role in enabling real-time telemetry as well as notification. Also, now this fits with the Eventual consistency model used in KLISS.

Sean McGrath

Featured Post

Linkedin