Friday, September 10, 2010

Sustainable data.gov initiatives in 4 easy steps

(Note: this post is largely directed at government agencies and businesses working with government agencies working on data.gov projects.)

At the recent Gov 2.0 Summit Ellen Miller expressed the concern that the data transparency initiative of the Obama administration has stalled. Wandering the halls at the conference, I heard some some assenting voices. Concerns that there is more style than substance. Concerns about the number of data sets, the accuracy of the data, the freshness of the data and so on.

Having said that, I heard significantly more positives than negatives about the entire data.gov project. The enthusiasm was often palpable over the two day event. The vibe I got from most folks there was that this is a journey, not a destination. Openness and transparency are the result of an on-going process, not a one-off act.

These folks know that you cannot simply put up a data dump in some machine readable format and call your openness project "done". At least at the level of the CIOs and CTOs, my belief is that there is a widespread appreciation that there is more to it than that. It is just not that simple. It will take time. It will take work but a good start is half the work and that is what we have right now in my opinion, a good start.

I have been involved in a number of open data initiatives over the years in a variety of countries. I have seen everything from runaway successes to abject failures and everything in between. In this post, I would like to focus on 4 areas that my experiences lead me to believe are critical to convert a good start into a great success story in open data projects.

1 - Put some of your own business processes downstream of your own data feeds


The father of lateral thinking, Edward de Bono was once asked to advise on how best to ensure a factory did not pollute a river. De Bono's solution was brilliantly simple. Ensure that the factory take its clean water *downstream* from the discharge point. This simple device put the factory owners on the receiving end of whatever they were outputting into the river. The application of this concept to data.gov projects is very simple. To ensure that your organization remains focused on the quality of the data it is pushing out, make sure that the internal systems consume it.

That simple feedback loop will likely have a very positive impact on data quality and on data timeliness.

2 - Break out of the paper-oriented publishing mindset


For much of the lifetime of most government agencies, paper has been the primary means of data dissemination. Producing a paper publication is expensive. Fixing a mistake after 100,000 copies have been printed is very expensive. Distribution is time consuming and expensive...

This has resulted – quite understandably – in a deeply ingrained "get it right first time" publishing mentality. The unavoidable by-product of that mindset is latency. You check, you double check, then you check again...all the while the information itself is sliding further and further from freshness. Data that is absolutely perfect - but 6 months too late to be useful - just doesn't cut it in the Internet age of instantaneous publishing.

I am not for a minute suggesting that the solution is to push out bad data. I am however suggesting that the perfect is the enemy of the good here. Publish your data as soon as it is in reasonable shape. Timestamp your data into "builds" so that your customers know what date/time they are looking at with respect to data quality. Leave the previous builds online so that your customers can find out for themselves, what has changed from release to release. Boldly announce on your website that the data is subject to ongoing improvement and correction. Create a quality statement. When errors are found – by your or by your consumers – they can be fixed with very little cost and fixed very quickly. This is what makes the electronic medium utterly different from the paper medium. Actively welcome data fixes. Perhaps provide bug bounties in the same way that Don Knuth does for his books. Harness Linus Torsvald's maxim that "given enough eyeballs all bugs are shallow" to shake bugs out of your data. If you have implemented point 1 above and your are downstream of your own data feeds, you will benefit too!

3 - Make sure you understand the value exchange


Whenever A is doing something that will benefit B but A is taking on the costs, there must be a value exchange for the arrangement to be sustainable. Value exchange comes in many forms:
- An entity a may provide data because it has been mandated to do it. The value exchange here is that the powers-that-be will smile upon entity A.
- An entity A may provide data out of a sense of civic duty. The value exchange here is that A actively wants to do it and receives gratification – internally or from peers - from the activity.
- An entity A may provide data because entity B will return the favor.
- And so on.

One of the great challenges of the public sector all over the world is that inter-agency data exchanges tend to put costs and benefits into different silos of money. If agency A has data that agency B wants, why would agency A spend resources/money doing something that will benefit B? The private sector often has similar value exchange problems that get addressed through internal cross-billing. i.e. entity A sees value in providing data to B because B will "pay" for it, in the internal economy.

If that sort of cross-billing is not practical – and in many public sector environments, it is not – there are a number of alternatives. One is reciprocal point-to-point value exchange. i.e. A does work to provide data to B, but in return B does work to provide data that A wants. Another – and more powerful model in my opinion – is a data pool model. Instead of creating bi-lateral data exchange agreements, all agencies contribute to a "pool" of data in a sort of "give a penny, take a penny" basis. i.e. feel free to take data but be prepared to be asked by the other members of the pool, to provide data too.

In scenarios where citizens or the private sector are the consumers of data, the value exchange is more complex to compute as it involves less tangible concepts like customer satisfaction. Having said that, the Web is a wonderful medium for forming feedback loops. Unlike in the paper world, agencies can cheaply and easily get good intelligence about their data from, for example, an electronic "thumbs up/down" voting system.

The bottom line is that value exchanges come in all shapes and sizes but in order to be sustainable, I believe a data.gov project must know what the value exchange is if it is going to be sustainable.

4 - Understand the Government to Citizen dividend that comes from good agency-to-agency data exchange


In the last point, I have purposely emphasized the agency-to-agency side of data.gov projects. Some may find that odd. Surely the emphasis of data.gov projects should be openness and transparency and service to citizens?

I could not agree more, but I believe that the best way to service citizens and businesses alike, is to make sure that agency-to-agency data exchange functions effectively too.

Think of it this way: how many forms have you filled in with information you previously provided to some other agency? We all know we need an answer for the "we have a form for that" phenomenon but I believe the right answer is oftentimes not "we have an app for that" but rather "there is no app, and no form, for that data because it is no longer necessary for you to send it to us at all".

Remember: The best Government form is the form that disappears in a puff of logic caused by good agency-to-agency data integration.

In summary


1 - Put yourself downstream of your own data.gov initiative
2 - Break out of the paper-oriented "it must be perfect" mindset
3 - Make sure you understand the value exchanges. If you cannot identify one that makes sense, the initiative will most likely flounder at some point and probably sooner than you imagine
4 – When government agencies get their agency-to-agency data exchange house in order, better government-to-citizen and government-to-business data exchange, is the result.

Thursday, September 09, 2010

The Web of Data and the Strata conference

I'm a grizzled document-oriented guy but I'm not blind to the amazing potential of numerical data on the Web. I do not think it is an exaggeration to say that in years to come, data volumes on the web-o-data will be, in order of size: multi-media data, numerical data and then text. Text will bring up the rear. A distant third behind numerical data, which in turn will be some distance behind multimedia data.

That is what I think the volume graph will look like but in terms of business value, I suspect a very different graph will emerge: numerical data - text data - multi-media data. In that order.

In blunt, simple terms, there is serious money in numbers and number crunching. As more and more numerical data becomes available on the web and is joined by telemetry systems (e.g. smart-grid) generating vast new stores of numerical data we are going to see an explosion of new applications. I had the good fortune to be involved in the early days of Timetric and they have now been joined by a slew of companies working on innovative new applications in this space.

At the Gov 2.0 conference that has just ended, I had the opportunity to talk to my fellow survivor of the gestation of XML, Edd Dumbill of O'Reilly who is involved in the Strata conference. Edd really gets it and I look forward to seeing what he pulls together for the Strata conference. Exciting times.

Wednesday, September 08, 2010

The Semantic Web is not a data format

Surfing around today, I get the impression that some folk believe that the Semantic Web is a data format question. It isn't in my opinion. It is an inference algorithm question. Data is just fuel to the engine. If we get sufficient value-add through the inference algorithms - the engines - the data format questions will fall like so many skittles. Deciding on a data format is, compared to the problem of creating useful inference engines, trivial.

Of course, to create an environment where clever inference algorithms can be incubated, you need a web of data but that is the petri dish for this grand experiment - not the experiment itself.

When I characterize the effort as an "experiment" I mean that it is not yet clear (at least to me) if the Semantic Web will usher in a new class of algorithms that provide significantly better inference value-add over the algorithmic approaches of the weak/strong AI community of the eighties. E.g. Forward chaining, Backward chaining, Fuzzy logic, Bayesian inference, Blackboard algorithms, Neural nets, probabilistic automata etc.

If it does, then great! The Semantic Web will be a new thing in the world of computer science. If it doesn't, the absolute *worst* that can happen is that we end up with a great big Web of machine readable data because of all the data format debates :-)

Even if the algorithms end up staying much as they were in the Eighties, we will see more interesting outputs when they are applied today because of the richness and the volume of data becoming available on the Web. However, that does not constitute a new leap forward in computer science. It is this point which is the sticking point for many who are dubious about the brouhaha surrounding the Semantic Web in my opinion.

I've never met anybody who thinks a web of machine readable data is a bad idea. I have met people who think the web-o-data *is* the semantic web. I have also met people who think that the semantic web is all about the inference performed over the data.

Of course, there are many who characterize the Semantic Web differently out there and one of the great sources of debate at the moment is that people find themselves passing each other at 30,000 feet because they do not have a shared conceptual model of what critical terms like "web of data", "semantics", RDF, sparql, deductive/inductive logic etc. mean.

Part of the problem no doubt is that many approaches to machine readable semantics involve the creation of declarative syntaxes for use in inference engines. These data formats are really "config files" for inference engines as opposed to discrete facts (such as RDF triples) to be processed by inference engines. Ontologies are a classic example.

My personal opinion : if the Semantic Web proponents were to stand up and say "Hey, there was all this amazing computer science done in the Eighties but there was never a rich enough set of machine readable facts for it to flourish...Lets give it another go!". I'd be shouting from the rooftops in support.

However, I tend not to hear that. Perhaps its the circles I move in? Most of what I hear is "The Semantic Web is a brand new thing on this earth. Come join the party!"

The CompSci major in me has trouble with that characterization. Its not universal but it does seem quite pervasive.

Yes, it is ironic that the stumbling block for the semantic web is establishing the semantics of "semantics" :-)

Yes, I derive too much pleasure from that. It goes with the territory.

Monday, September 06, 2010

History in the context of its creation

Tim O'Reilly's Twitter feed pointed me at this great piece on historiography.

I just love the 12 volume set of the evolution of a single Wikipedia entry. In KLISS we take a very historiographic approach to eDemocracy.

The primary difference in the way we do it to the Wikipedia model is that we record each change - delta - as a delta against the entire repository of content : not just against the record modified. Put another way, we don't version documents. We version repositories of documents.

In legislative systems, this is very important because of the dense inter-linkages between chunks of content. To fully preserve history in the context of its creation, you need to make sure that all references are "point-in-time" too. I.e. if you jump back into the history of some asset X and it references asset Y, you need to able to follow the link to Y and see what it looked at *at the same point in history*.

Obviously, this is only practical for repositories with plausible ACID semantics. I.e. each modification is a transaction. It would be great if the universe was structured in a way that allowed transaction boundaries for the Web as a whole but of course, that is not the way the Universe is at all :-) - And I'm not for a minute suggesting we should even try!

Having said that, versioning repositories is a darned sight more useful than versioning documents in many problem domains : certainly law and thus eDemocracy which is my primary interest i.e. facilitating access, transparency, legislative intent, e-Discovery, forensic accounting and the like.

The fact that supporting these functions entails a fantastic historical record - a record that future historians will likely make great use of - is, um, a happy accident of ths history we are currently writing.