Sunday, February 22, 2009

Draconian measures - again

Avery Pennarun is pondering XML's contribution to the genre of Programmers and Sadomasochism.

For those interested in this subject is a useful springboard.

The use case that ill sits with Postel's Law is the one where I'm sending you the XML message needed to change the parameters of the [nuclear reactor/medical monitor/industrial process/zillion dollar fund transfer exchange] and you get duff XML. Should you be liberal in what you accept?


Philip Taylor said...

If the message is that important to you, you should include a message digest that will detect any unintentional modifications. Otherwise a random bit flip might change your $1,000,000 bank transaction to $2,000,000, and XML's draconian syntax checking won't help at all. Once you've got the guarantees a message digest provides, you don't need the guarantees that XML syntax well-formedness provides.

In cases where you don't care that much about integrity, strict input checking often introduces denial of service vulnerabilities. The comments in include some examples where inserting certain characters into a wiki page prevented other users from reading it because their browser's XML parser found the trivial error and aborted. If a major commercial web site allowed user comments and had the same subtle bug, a user could effectively shut down the site by posting bogus comments. It seems a bad idea to choose a technology that claims that kind of fragility as a feature.

Paddy3118 said...

Avery is just adding to the problem by advocating sloppy practices. He should instead be thinking in terms of his reader as being a junk259 format to XML converter and asking his producer to send valid XML next time.

- Paddy.

kgaughan said...

I wish people would stop misunderstanding Postel's Law like this. The idea of the robustness principle is that when you produce something, you shouldn't assume that consumers will be able to cope with edge cases, uncommonly used constructs, &c., and that when you consume something, you should be able to understand as many valid inputs as possible, make sure that you validate for malicious inputs, and can fail gracefully when you get something you don't understand.

Anonymous said...

I have to admit that I naturally agree with (what is apparently, according to Mark Pilgrim, only) Tim Bray's position on this.

Reading Mark's piece I was worried that various important, respected and, more importantly, generally thoughtful people seemed to be arguing against it so I felt that I was missing something.

I felt a bit better when I could easily poke holes in the arguments presented by Pennarun, and better still when the chattering classes of ycombinator and reddit (that he links to in his update) seem to be in general agreement with me.

I still have a nagging suspicion I'm missing something from the 'be liberal in what you accept' side and Mark Pilgrim's older blog posts on the topic -- linked from ycombinator -- seem to be offline (to me at least). And anyway the post of Mark's linked from this post seemed a bit of a personal attack.

Could anyone point me to a more reasoned argument either against XML's fragility when asked to parse non-XML data, or arguing for postel's law actually being valid in light of the 21st century internet.

I took a stab myself by Googling "Postel's Law" and "Evolutionary Stable Strategy" together and found this:

kgaughan said...

@Anonymous I think what you might be missing is the element of graceful failure. The problem with XML isn't things like the fact that it requires quotes around attributes. In fact, if it allowed you to omit them and still had the very same error handling model, it'd still be draconian. That's Pennarun's fallacy.

Sometimes draconian error handling is what you want, sometimes it's not, but if a format allows you to chose, it should provide a well-defined model for graceful failure. That's part of the robustness principle that people keep forgetting.

Bijan Parsia said...

To put in slogan form:

Be conservative in what you send,
be liberal in what you accept,
don't be naive about what you send or accept.

(Just as you shouldn't blindly trust well formedness! for critical things you accept, even very conservative sends may need per transaction measures.)

srittau said...

I think the best example of why draconian parsers are a good idea is actually the web and HTML. HTML parsers were always quite liberal in what they expect, so writing an HTML parser that copes with what's out there today is hard. If web browser had rejected wrong HTML early on, this had been much easier.

So, Avery wrote an easy XML parser that can cope with missing quotes around attribute values. Now maybe I am writing a very simple parser that's only interested in the attribute values of certain tags. Using syntactically correct XML this is very easy to achieve using regular expressions. But if I have to cope with possibly invalid XML I have to think about all the various edge cases, and factor them in, complicating my code, and possibly missing certain cases.

I think there are two fallacies in Avery's example. Had Bob used a draconian parser to test his XML from the beginning, there would not had been a problem today. Not having a draconian parser is what actually created the problem in the first case. And option 2 sounds great in theory. But in practice nobody guarantees that your permissive parser actually works with Bob's particular variant of "XML". Maybe Bob used an attribute value without quoting it, consisting of two words, while your parser assumes that the space separates two attributes, so the second word is assumed to be another attribute without an attribute value.