RDFa and the DOM

by Edward O’Connor on 2 December 2009

Some thoughts on philosophical differences in the markup standards world.

One of the nicest aspects of the HTML5 spec is that it defines HTML in terms of the DOM—the ‘classic’ and XML serializations are just different ways to write down the same document. This is a good thing. I wrote back in April about how, as a web developer[…] I don’t have to care if the browser used an XML parser or an HTML parser, because I can write code that works on the DOM that comes out at the other end. This applies to both browser-based scripting as well as outside-of-the-browser tools that process web content.

Other technologies intended to be Web standards, such as RDFa, use a different model. RDFa was defined in terms of text and not the DOM or the XML Infoset. This wasn’t a problem when RDFa was constrained to live in an XML world, but now that Manu et al. are specifying how to process RDFa in HTML, this basic disconnect is rearing its ugly head. Consider this exchange between Shane McCarron and Henri Sivonen on the RDF in XHTML mailing list. Shane wrote (in reply to Maciej):

I think that your suggested text is correct when talking about the DOM and Infoset processing. But the processing rules in section 5.5 are not written from a DOM or Infoset perspective—at least not exclusively nor intentionally. We really, really, really were talking about the syntax and then the extraction of data from structures that conform to that syntax.

Henri replied (emphasis mine):

I think it’s a fundamental spec writing error to specify RDFa processing in terms of syntax as opposed to defining it in terms of the data structure abstractions that HTML parsers and XML processors output.

An HTML parser or an XML processor sees the bits that come from the wire. Assuming that you intended an RDFa processor to be layered on top of an HTML parser or an XML processor, the RDFa processor never gets to see the bits on the wire. It gets to see the output of the HTML parser or the XML processor. Therefore, it’s wrong to define the behavior of the RDFa processor in terms of bits on the wire and it would be correct to define it in terms of the output data structure abstractions of HTML parsers (namespace-aware DOM) or XML processors (the Infoset)[…]

Alternatively, if an RDFa processor were defined to operate on the bits on the wire, RDFa shouldn’t give the impression that it’s layered on top of XML or HTML. Instead, it should define everything from the bits on the wire up and conspicuously warn implementors that they shouldn’t try to use off-the-shelf XML processors or HTML parsers. (But that would be fundamentally bad, too.)

This isn’t just a difference of opinion between some HTML5 folk and some RDFa folk. There’s a broader disconnect at work. Recall Hixie’s lunch with the TAG back in June 2007 (emphasis mine):

In other news, I met the TAG for lunch today. It was an interesting experience. I still don’t really understand what they’re doing. At one point I asked about one of their documents and pointed out that for most specs things have to be defined in terms of document models, not the actual byte streams coming over the wire. For example, for HTML you have to define what happens when an author creates a table element using the DOM APIs and then moves a p element into it — what does that represent. There’s no source byte stream, it’s all scripted. The members of the TAG seemed to think that was a little more complicated than they wanted to deal with. That was sort of strange to me, since I somewhat consider that to be the only interesting case (in the HTML5 spec, the byte stream, if any, is converted to a DOM before any of the things that they were talking about are examined). Oh well.

I should note that Manu Sporny, after discussing in detail with [Henri,] agreed to produce an HTML5+RDFa draft with Infoset processing details. I had written most of the above several months ago, before he started this work. Thanks for the update, Manu.

Language formalism & error recovery

There’s an almost complete overlap between the folk who like to talk about things in terms of syntax, and the folk who prefer draconian (or unspecified) error handling over the HTML5 approach of specifying how to pull a DOM out of any bytestream. I think this has to do with a reliance on language formalisms.

XML as a language is formally defined as the set of documents which conform to its syntax rules. If a document fails to conform to said rules, it isn’t XML, full stop. On this view, it’s pretty much nonsensical to talk about error recovery because, if it’s XML, there are by definition no errors to recover from.

I come at things from a different viewpoint. It seems to me that a document that is well-formed XML except for one unescaped ampersand is XML, it’s just not well-formed XML. On this view, specifying sensible error recovery options for such documents is pretty obviously the way to go.

This is the approach taken by HTML5: if a bytestream is served as text/html, then it is HTML. It might not be valid HTML, it might not even resemble HTML all that much, but it is HTML nevertheless and should be processed as such.

There’s a similar philosophical difference in the Lisp world. Common Lisp is defined in terms of its own object model; Lisp programs are trees of Lisp data. The Lisp printer and reader, both programmable, provide serialization and deserialization to the characters-on-disk format programmers actually edit. Scheme, on the other hand, is defined in terms of textual syntax (see R5RS §7 for the gory details). This is just one of the ways that Scheme favors its Algol heritage over its Lisp ancestry.

Digging my grave with more Lisp/markup analogizing, (some) Schemers complain about the size and complexity of the Common Lisp spec in much the same way, and for many of the same reasons, as people complain about the size of the HTML5 spec today. See KMP’s “dpANS Common Lisp: Don’t Judge a Spec by its Cover!.” Of course, Scheme is a minimalist, academic language with almost no industry adoption—think XHTML2—whereas Common Lisp is a “batteries-included” language for practical development—think HTML5.

Joe Marshall once said that arguing over Lisp and Scheme is like arguing over Guinness or Murphy’s when everyone else is drinking Bud and Miller, so I should probably stop now.