Sean McGrath: 06/13/2010

Saturday, June 19, 2010

KLISS: The importance of naming things well

Last time, in this series on KLISS, I talked a little about the useful overlaps between paradigms/tools/techniques for managing corpora of source code and corpora of law. I mentioned that when I personally look at law from my engineering perspective, I see the same sorts of things I see when I look at source code, namely highly structured, densely inter-linked, temporally bound units of text.

There is a temptation – one I intend to avoid – to jump at this point into concerns about the units of text themselves and in particular, to worry about what format the units should be stored in. In order words, to worry about syntax. Should the law be HTML? Should it be Docbook? Should it be <insert name of word-processor or DTP package here>? I don't want to go there today. Not because the question is not important. It is *very* important. But there are bigger, more important questions that need to be addressed first. First amongst them being the question of naming. Yes, as trivial as it sounds, I want to talk about naming things.

Phil Karlton once said that there are two hard problems in computer science: cache invalidation and naming things. Anybody who has written any software knows the effort that goes into naming things. Files have names. Files live in folders that have names. Files contribute to modules that have names. Modules are made up of classes, methods, functions,variables which...yes...all have names. Functions/methods consist of statements that either create new names or reference existing names or other functions/methods, modules... Names everywhere.

Law is similar. Bills have names. Statute titles of names. Committees have names. Agencies have names. Parliamentary procedures have names. Voting Members have names. Bills refer to statute by name. Statue refers to statute by name. Committees refer to Bills by name. Journals refer to Committees by name. Calendars refer to committees by name... Names everywhere.

I labor this point because names are the vehicle through which the dense inter-linkages are expressed : both in source code and in law. It is not possible, in my opinion, to have an information model in either domain without a detailed conceptualization of naming. You could call it a "naming convention" and that would be fine but I prefer to call it a "referencing model" because so much of the value in a naming convention comes from its use to reference – to pick out – information objects.

So how, historically has law gone about "picking out" information objects like bills and statute? How (in its much shorter history) has software gone about "picking out" information objects like functions and modules?

Three examples from law, each with a short explanation about what I find interesting about it from a referencing model perspective:

United States v. Lane, 474 U.S. 438 (1986): Picks out a particular unit of text (in this example case law) by providing a set of attributes that includes a timestamp. No other context required.
HB 2130 approved on final action: Extra context required in order to pick out a unit of text (in this case a house bill), because numbers like "2130" are re-used every legislative biennium.
K.S.A. 74-8905 and amendments thereto: Picks out a unit of text (in this example, a statute) but implicitly adds "as it looks today" by adding "and amendments thereto".

Notice how time is critical in all three examples in order to pick out a definitive unit of text. The first locks down time explicitly with a timestamp. The second cannot be used to yield a unit of text without further context i.e. what Biennium (and indeed, what Legislature) is being referred to here? The third one picks out a unit of text but allows for the unit of text to change, depending on when you de-reference this reference. If you "look" tomorrow, 74-8905 might say something different from what it says today.

Each of these referencing approaches can be found in software too:

from string import regex: Yields a unit of text but without knowing what version of Python is installed, we cannot be sure what is in it.
java -jar poi-3.6-20091214.jar: Yields a unit of text unambiguously by virtue of the version and timestamp information included in the name of the jar file. (Extra surety is provided by an MD5 hash value so that we can know that our poi-3.6-20091214.jar is the same as that published by the developers.)
google docs edit -- title "Shopping list": Picks out a unit of text (the Google docs application) but allows the application to vary. In other words, what you get when you de-reference is the application as it exists right now. It might be different tomorrow.

Naming things is just plain hard. If you did not believe that before now, I hope I have helped convince you. Picking a unit of text unambiguously out of the ether and keeping its semantics in exact accordance with the intent of the original creator is a deep, deep problem in many walks of life. Two of which are law and computer science. Doing the problem justice would require a very long detour into semiotics, semantics, pragmatics, linguistics, epistemology and situation theory to name a few. Although fascinating stuff if it floats your boat (it floats mine although, like law, I am strictly a lay man in this field), we will limit the discussion to the smallest amount of language theory necessary for me to explain how KLISS works with respect to naming things. Namely (!), what is known as the descriptivist theory of names and in particular Kripke's concept of Rigid designation.

The problem of naming things is as old as human communication and remains "unsolved" to this day. When I say it is unsolved, I mean that we spend most of our time as humans referring to things ambiguously and we use context and probabilities to disambiguate. If I say "Python" (there, I just said it!) you will probably think "Python the programming language" because of the context in which you read this text. You will not (I suspect) immediately think "Python the snake" but you won't completely rule it out either. It is just more likely that I'm referring to – picking out - the programming language. Similarly "HB2145" is ambiguous without more context but if you read about HB2145 in the Journal of the House in the great state of Tumbolia in 2010, you will likely conclude that it refers to – picks out – HB2145 in Tumbolia in the 2010 legislative session. In fact, if there is no other surrounding context you may conclude that the unit of text being referred to is HB2145 as introduced – as distinct from as amended by committee or floor action for example.

Bertrand Russell and Gottlebe Frege are two of the philosophers who thought about the problem of naming things and were (very broadly speaking) of the opinion that names where really query expressions in disguise. i.e. a name like "HB2145" is really a short code - an alias - for the full name which is something like "HB2145 as introduced in the Tumbolia state legisature, 2010".

Kripke (very broadly speaking again!) disagreed. The details need not concern us here. Suffice it to say that Kripke coined the term "rigid designator" to mean a name that picks out the same thing in all possible worlds.

In all possible worlds...What a great thing to have! If you read my earlier KLISS post about the worryingly quantum mechanical nature of digital data you will see why I find the notion of a rigid designator so appealing.

If I had a rigid designator for each unit of text in my corpus of law (or source code):

I would not need any other context to get at the unit of text (the "referent" as it is known in the vernacular)
I would not need to worry about who is referencing, when they are referencing, where they are doing the referencing from etc. The same unit of text will be yielded every time.

That sounds just perfect for legislative informatics! Next time, I'll talk about how we incorporate rigid designation into KLISS.

For now, let me finish by mentioning a conversation I had with Bertrand Russell once. I think I have the link here...try this or maybe this...

Do both links bring you to the same place? Are both links the same? :-)

Next up Rigid Designation in KLISS.

Friday, June 18, 2010

WordCup = WorldCup - France; Mexico++

Given what happened my empathy response to this is somewhat muted.

Thursday, June 17, 2010

KLISS: Law as source code

Over the past couple of days I have received some comments - and some pushback - about my assertion that law is basically source code, so I'd like to explain what I mean. As it happens, explaining that is also a good way for me to start to explain the Legislative Enterprise Architecture that underpins KLISS, so here goes.

When I look at a corpus of law being worked by a legislature/parliament I see...

text, lots and lots of text, organized into units of various sizes: sections, bills, titles, chapters, volumes, codes, re-statements etc.
The units of text are highly stylized, idiomatized, structured forms of natural language.
The units of text are highly inter-linked : both explicitly and implicitly. Sections are assembled to produce statute volumes, bills are assembled to produce session laws etc. Bills cite to statutes. Journals cite bills. Bills cite bills...
The units of text have rigorous temporal constraints. I.e. a bill that refers to a statute is referring to a statute as it was at a point in time. An explanation of a vote on a bill is an explanation of a vote as it looked at a particular point in time.
The law making process consists of taking the corpus of law as it looked at some time T, making some modifications and promulgating a new corpus of law at some future time T+1. That new corpus is then the basis for the next iteration of modifcations.

When I look at a corpus of source code I see...

text, lots and lots of text, organized into units of various sizes: modules, components, libraries, objects, services etc.
The units of text are highly stylized, idiomatized, structured forms of natural language.
The units of text are highly inter-linked : both explicitly and implicitly. Modules are assembled to produce components, components are assembled to produce libraries etc. Source files cite (import and cross-link to) other source files. Header files cite (import and cross-link to) header files. Components cite(instantiate) other components...
The units of text have rigorous temporal constraints. I.e. a module that refers to a library is referring to a library as it was at a point in time e.g. version 8.2. A source code comment explaining an API call is written with respect to how the API looked at a particular point in time.
The software making process consists of taking the corpus of source as it looked at some time T, making some modifications and promulgating a new corpus - a build - at some future time T+1. That new corpus (build) is then the basis for the next iteration of modifications to the source code.

What we have here are two communities that work with insanely large, complex corpora of text that must be rigorously managed and changed with the utmost care, precision and transparency of intent. Yet, the software community has a much greater set of tools at its disposal to help out.

How do programmers manage their corpus of text - their source code? In a database? No (at least not in the sense that the word "database" is generally used). Instead they use *source code control systems*. What do these things do? Well, the good ones (and there are many) do the following things:

Keep scrupulous records of who changed what and when and why
Allow the corpus to be viewed as it was at any previous point in time (revision control)
Allow the production of "builds" that take the corpus at a point in time and generate internally consistent "products" in which all dependencies are resolved
Allow multiple users to collaborate, folding in their work in a highly controlled way. Avoiding "lost updates" and avoiding textual conflicts.

The above could be used as an overview of everything from DARCS to Mercurial to GIT to SVN - and that is just some of the open source tools. It is, to my mind, exactly the sort of feature set that the management of legal texts requires at its foundational storage and abstraction level. Right down at the bottom of the persistence model of KLISS is a storage layer that provides those features. On top of it, there is a property/value metadata store for fast retrieval based on facets, an open-ended event framework for change notification and the whole this is packaged up as a RESTian interface so that the myriad of client applications, from bill drafting to statute publication to journal production to committee meeting management...do not have to even think about it. But I digress. More on that stuff later...

The natural language, textual level of law is my focus. I'm not attempting to make computers "understand" law by turning it into propositional calculus or some such. I'm happy to leave that quest to the strong AI community. The textual focus is why I prefer to say that "law is source code" rather than to say that "law is an operating system" because when you "execute" the law, it does not behave like most software applications. Specifically, a given set of inputs will not necessarily produce the same outputs because of the human dimensions (e.g. juries) the ongoing feedback loop of Stare decisis, the scope for once "good law" to become "bad law" by being overturned in higher courts and so on.

I believe there is much that those who manage corpora of law can learn from how software developers have met the challenge of managing corpora of source code. There are many differences and complicating factors in law for sure (and I'll be addressing many of them in the KLISS posts ahead) but at a textual level - the level at which Bismarck's Sausage Machine largely works - there is a very significant degree of overlap in my opinion. And overlap that can be and should be leveraged.

It will not happen overnight but now seems to me like an excellent time to start. All the stars are aligned: the open government directive, linked data, law.gov, the semantic web initiative, cloud computing, eBooks, revision control systems, text analytics etc. etc. The pressures are building too. The law itself is in danger in my opinion: even if open access and the many paywalls were not a problem, there is a significant authenticity issue that needs to be addressed. In an electronic age, with more and more law "born digital" the old certainties about authenticity and accuracy are rapidly fading. A replacement paradigm simply must be found. More on this topic later on when I get to talking about the KEEP (Kansas Enterprise Electronic Preservation) project I am working on now along with KLISS.

Next up: the importance of names.

For music loving, soccer loving people everywhere

This is too funny

Wednesday, June 16, 2010

Its Bloomsday but...

So it is Bloomsday again and I'm sure the usual suspects are gathering in Davy Byrnes to partake of a Gorgonzola cheese sandwich and a wee drop of best Burgandy.

On this day every year I prefer to ponder Finnegan's Wake - not because of its literary value particularly - but because it pushes some of my nerd buttons. If you have been reading the KLISS series of posts those buttons will be known to you.

Take a look at this page for example.

It is the first page of Finnegan's Wake. Notice the lowercase "r" it starts with. That is because the sentence actually starts at the very end of the book here. The entire novel is a loop.
The book contains a huge number of references to other books and other context which must be understood in order to fully understand the book (thank goodness for literary scholars who dig it all out so that lay readers don't have to). In literary analysis - just as in legal analysis - the content itself is often not king. The context is king. Without the context, the meaning might not be there. In primary law, it rarely is. Primary law tells you what the law says but caselaw tells you what it means.
Like all self-respecting books on "everything", it includes references to itself.
It is brimful of references to time, time experiments and even quantum mechanical noodlings on the lack of a bottom to atomic decomposition
Note the line numbers down the side. These were not in Joyce's books but scholars add them in order to be able to do fine-grained citation and synoptic analysis. I worry a lot about line/page numbers:-)

Tuesday, June 15, 2010

We are hiring Django developers

Propylon are hiring Django developers to work out of our soon-to-be-opened office in KU, West Campus, Lawrence. If you are a seasoned Django/python developer and you know your way around some proper subset of the buzz phrases below, and can start soon, please contact me.

ODF, XML, EAD, METS, Premis, X.509, DOM, JMS, ActiveMQ, Jython, SOAP/WSDL, PKI, PDF/A, MySQL, XSLT, XQuery, JSON, TDD, SCRUM, NOSQL, REST.

Note that we work primarily in document-centric problem spaces. We are not your classic 3-tiered, RDB-oriented development shop. If document-centric problems do not float your boat, it is probably best not to apply. We *want* our developers to be happy and that can only be if they like their work. As a sort of litmus test, I suggest you read the KLISS series of posts. KLISS is representative of the sort of work we do. If the problems outlined at the start of that series float your boat...great.

Monday, June 14, 2010

XML in legislature/parliament environments: content aggregation and its role in content modeling and information architectures

Last time in this KLISS series I talked about the subtle inter-twingularities between content and presentation and how, in legal/regulatory environments, the quantum-like uncertainties concerning what you will see what you observe a document through software-tinted glasses, are enough to keep you awake at night. This is especially true if you are trying to apply the XML standard model with its penchant for dismissing "rendering" as a merely "down-stream" activity with no desirable or necessary impact on the "up-stream" content architecture. Would that it were so....

Today I want to close out my outlining of the problem areas with some discussion about content aggregation and its role in content modeling and information architectures in legislatures/parliaments. The best way to approach the problem I want to highlight is by reference to the XML standard model. Consider the time honored concept of "document analysis". In this phase of any XML project, you get your hands on documents produced by the entity under analysis and then you classify and decompose them. By "classification" I mean that some type system is constructed into which all the documents "fit". For example you might proceed like this:

The customer thinks of all the documents in this pile as "Bills"
Bills always have a number and a short title and a long title and one or more sponsors.
Ok, let us declare "Bill-ness" to be present in any document that has:
- The word "Bill" in its long title, and
- A unique(-ish) number, and
- a short title, and
- a long title, and
- one or more sponsors.
Hmm...now when we look closely at all these bills, they appear to sub-divide into sub-types. There are money bills, bills that change statute, bills that don't, bills that substitute for other bills etc. etc.

There are two primary problems with this "bring out your dead trees" approach in legislatures/parliaments from my perspective.

It pre-supposes an acceptable dominant de-composition for each document type
It does not take into account the multiple levels of transclusion that can take place during production of the document. Making that transclusion process explicit in the model is oftentimes the key to deriving value from XML-based systems in legislatures/parliaments in my experience.

Let us take each of these in turn.

Pre-supposing an acceptable dominant de-composition for each document type

In many content architecture puzzles, there comes a point where the words in the documents are just that: words. In legislatures/parliaments however, the words are incredibly important because they often speak about the workflows that the documents themselves have already gone through or are about to go through.
Put a bill on front of a layperson and after the classic structured stuff at the front, they will just see lots of words. Put the same bill on front of an attorney steeped in the lore of the law and of the law making process, and they see a completely different information vista. They see the formulaic language signaling a repeal. They know that this is a recession bill, not because it says so, but because of its impact on the general fund. They know that the various enactment dates have been crafted to ensure that conflicting states are avoided. etc. etc.
Now, how many of the things I just mentioned can be or should be tags in the data model? Here is what I have found in my experiences with legislatures/parliaments:

Everything is a candidate for being tagged because nobody wants to risk leaving something out.
Communities of interest invariably have different, irreconcilable data models in their world views. A bill really does look different to a drafting attorney.
All communities share the belief that their model is the most important and should be foremost in the model.
The entity doing the modeling is rarely empowered to disappoint anybody in the room and therefore, all possible world-views are mashed into a single tag soup in which essentially every possible element can occur in ever possible context.

This problem is especially serious when the entity charged with doing the modeling, is not incentivized to produce a model that will actually work in practice. A common example of this is the all-to-common pattern of having entity A create the information models and a separate entity B build the actual system. If entity A has no incentive to tackle the tyranny of the dominant decomposition they are unlikely to do so.

Even if a dominant decomposition is agreed upon and the worst excesses of tag-soup avoided, information models often degrade towards tag soup over time. A schema goes into production and a new element is required or perhaps an existing element is required in a new context. The easiest way to accommodate this whilst guaranteeing backwards compatibilty is to loosen the content constraints. Do this a few times and your once constrained, hierarchical, validation-enforcing schema has suffered information entropy death.

Grammar-based schemas are not statistical in nature and that is one of their great weaknesses in my opinion. All elements do not occur with equal probability. Far from it. In fact, many of the truly document-oriented corpora I have looked at have power law distributions for their elements.

The practical upshot of this is that regardless of how long you spend and how many stake-holders you get into the room and how much money you pay for your schema-to-end-all-schemas, 20% of your information elements will account for 80% of your elements-as-used.

Finally on this point, it never ceases to amaze me how much of that 80% - i.e. the tags actually used - consists of paragraph, heading, bold, italic tags i.e. tags carrying essentially zero classical XML semantics!....Hold that thought. I will be returning to it later in this series.

Insufficient modelling of transclusions

The second problem I have with the classic document analysis in legislatures/parliaments relates to the critically important area of transclusions. Legislatures/parliaments are simply rife with these things! Bills contain statute sections but the statute sections must also live stand alone. Amendment lists contain amendments but the amendments themselves must also stand alone. Journals contain votes but the votes themselves must stand alone. Final calendars contain calendars but the calendars must stand alone...

...The fun really starts when you follow the content back to its creation. Bills contain statute sections and the sections stand alone. Fine. Where did the statute sections come from? They came from the bills that enacted them. Ok, what was in the bills? The statute being amended. Ok. Where did that come from? The bills...

Take another example. The Bill Status tells that something interesting happened to the bill on page 12 of yesterdays journal. Where did the page citation come from? The journal. What was in the journal, the bill + other things. When was the bill status updated? As soon as the action happened on the bill. When was the journal produced? 24 hours later. When did the page number get into the Bill Status? A few hours after that.

Final example, something happens on the chamber floor that changes the status of a bill. The event is recorded so that the Bill Status can be updated. But the event must also be recorded in the journal so it is entered in there too. Maybe the change to the bill was that it was referred to a committee. That means that the state of that committee needs to change, which will cause a change to its meeting schedule which will result in meeting minutes will result in messages to the chamber which will result in entries in the journal...

Around and around it goes. Where it stops...actually it never stops. Legislatures/parliaments are the biggest Hermeneutic circles I have ever encountered. No field of human endeavour – with the possible exception of software engineering (yes, I will be coming back to that) is more worthy of using Escher's Drawing Hands as its emblem.

"So what?", is perhaps what you are thinking. So the information is complex, the flows are complex and they feed back on themselves. I am dragging you through this because I firmly believe that fully engaging with these complexities is what Legislative Enterprise Architecture needs to be all about if true value is to be derived from legislative IT projects. It is possible to use these feedback loops to generate efficiencies and reduce errors and increase transparency and improve service to Members and do all those things...but you won't get there if you ignore them.

What you will get instead are silos. If you do not take the time to look at all the information flows and all the feedback loops you get silos. I have seen a goodly number of fully XML compliant, state-of-the-art document management systems that are good old fashioned silos.

It is not unusual for journal systems to be developed independently of bill drafting systems even though bills flow into journals.
It is not unusual for bill drafting systems to be developed independently of statute codification systems even though bills are the starting point for codification and the codifications end up being the source of statute for new bills.
It is not unusual for journal systems to be developed independently of bill status systems even though the journal and the bill status systems need to agree on what happens to bills.

I hope I have convinced you that the overlaps and inter-relationships between the data items in legislatures/parliaments are many and deep. This is not a problem that can be solved by throwing tags around like pixie dust. This is not a domain in which serious value can be extracted from computerization projects without fully engaging with the domain based on what it really is, not on what we might like it to be. Bismarck's sausage machine is not hidden from us but we cannot understand it unless we are willing to look deep into it...

When you do so, what you see is a machine fuelled by its own feedback loops; brimming with time-based citations and time-based transclusions; replete with subtle inter-twinglements between content and presentation; overflowing with cascading event streams. The complexity can be overpowering at first but after you have seen a few of them up close, the patterns emerge and opportunities to leverage the patterns with technology present themselves.

In the next post, I want to start looking at these patterns.

Sean McGrath

Featured Post

Linkedin