Monday, January 29, 2007

Mixed Content : Trying to understand the JSON thing

Doug Crockford on JSON. I see a lot of this JSON v. XML stuff going on...most of the time, mixed content doesn't get considered in the discussions. I don't see it mentioned in Dave's writing but I could be wrong...

Anyway, I think MIXED CONTENT is verrrry significant and gets to the heart of the difference between XML and ... most other data representation languages.

Here is XML's sweet spot (using square brackets to keep everything un-mungable by the angle-bracket chewers in my current tool chain):

[p]We [b]wish [i]you[/i] and yours[/b] a happy XMas.[/p]

XML of course, can also do this kind of data (again using square brackets for simplicity):

[person][first]Sean[/first][second]McGrath[/second][/person]

There are a trillion and one ways of representing the latter, most programming languages do it out-of-the-box. Here it is one way to do it in Python:

[{'first':'Sean','second':'McGrath'}]

Representing this latter type of data - fielded data - directly in a programming language syntax by-passes a bunch of "XML Situps" that would otherwise be required to marshall it in and out. Such techniques have their place. I use them all the time.

Handling the former though, requires what XML provides : mixed content.

If you absolutely, totally, never, ever will need mixed content then there are sane alternatives to XML. There always has been. from humble CSV up to fancier JSON/Python/Ruby direct data expression languages. If you want to use XML but long for an API that mirrors the simplicity of the XML subset you are using, use something that takes an elementtree style world view.

A huge chunk of the world doesn't need mixed content or even know what it is. They are the folks who look at the XML apis and wonder "why is this so difficult?", "where is the get_field_value()" function...

I think we missed a trick early on in the XML days. Its too late now I suspect. We should have provided some way for an XML document to indicate data-centric content as opposed to document-centric content. That way, tools could swith to "obvious" field-oriented APIs (like RAX) for data-centric applications without loosing the powerful enabler of a single unifying syntax for open data representations.

It has always been a source of worry that folks with perfectly good relational data sets have felt compelled by buzz-pressure to put their content into XML - very little gain in the general case.

However, it has also always been a worry that a significant portion of XML's users think it is too complex because they do not, day-to-day, have to handle the mixed content case.

XML is, and always was, a document centric data representation language. See Mixed Content Myopia.

Now it can be argued that mixed content XML can be finessed into a field-oriented world view of sorts like this:

[p][text]We [/text][b][text]wish [/text][i][text]you[/text][/i][text] and yours[/text][/b][text] a happy XMas.[/text][/p]

True but (a) it sure is hard on the eyes. Maybe that's not a killer problem in these tool-centric days but (b) is a doozy:-

With the text element trick, there is no way to *name* individual text nodes uniquely. get_field_Value() doesn't make sense. All, in all, it just ain't worth it.

Hence, mixed content. Hence XML. If you need mixed content you really need it. If you don't need it, sometimes you cannot even conceptualise the problem it solves. And yes, mixed content totally complicates the lives of those who are using XML for data-centric applications.

The "standard" APIs of DOM and SAX handle the general case. They are extremely sub-optimal for the very common data-centric case. We have no current standard way to differentiate the former from the latter.

It would be a shame if this resulted in a "fork" in the road with fielded data, yet again, going off on its own trajectory with document-centric data staying on the XML road. Too much good stuff to be lost that way.

The nut that needs to be cracked to stop this happening is the Mixed Content Case nut.