Markup Madness, Part Two: Who’s Afraid of the XML Web?

Tuesday October 09th 2007, 5:24 pm Printer Friendly Version
Filed under:Firefox, World Wide Web
Posted By: Matt

In part one of my apparently very occasional series on markup, we looked at the web’s equivalent of the Odd Couple: HTML and XHTML. Like Oscar Madison, HTML leaves its tags strewn all over the place, expecting the parser to clean up after it. (Hey, your mother doesn’t work here, HTML.) XHTML is a neat freak in the mold of Felix Unger, with its tags vetted against a document schema and nested just so. While the migration path to XML-based markup on the web might be fraught with difficulties, the motivation for taking it should be clear.

Nonetheless, some commenters questioned the value of an XML-based web. Even more controversial was the idea that browsers should reject pages containing ill-formed or invalid markup. Surely displaying something must be better than displaying nothing at all, right?

Actually the XML folks were adamant that XML should follow in SGML’s footsteps by rejecting bad markup. And in fact, the question of whether an XML web has value is the same as that of whether to try to milk something meaningful out of misbehaving pages. As soon as we accept markup that isn’t well-formed or valid, we can be sure that authors won’t bother to fix their mistakes. Don’t believe me? Surf to 99% of web pages and view source. The good news is that if major browser vendors enforced correct markup, authors would rapidly get a clue, just as they have been scrambling to fix pages that don’t play well in Firefox as the latter’s market share has grown.

The advantages of XML on the web are numerous. Simple text processing a la Perl can be used for more tasks. DOM processing is less confusing because strange invisible tags like tbody don’t get inserted by the parser. You can mix and match XHTML and other XML vocabularies like MathML and microformats.

By far the most significant argument for XML on the web, however, is the complexity of existing HTML processors, a direct result of the compatibility hacks required to deal with naughty content. One might argue that good, robust HTML processors like Gecko, WebKit and Presto make this issue moot. After all, if I can use a black box to pass me a nice clean DOM, why worry about what manner of malformed mutant markup it vacuumed up to do so?

The discussion around my previous post (along with a post by Robert O’Callahan with enough meat to keep a motivated student of browser technology busy for days) makes the answer abundantly clear: there’s plenty of scope for improving existing user agents, in terms of performance, security, hackability and footprint (static and dynamic). And the cruft that the current generation of HTML engines must accumulate to deal with real-world web markup is a huge barrier to progress.

The original XML working group aimed explicitly to keep the standard simple enough that the average CompSci grad student could write a complete parser in a weekend. Granted, they didn’t quite succeed in their goal, but an XML processor is still orders of magnitude less complex than its HTML equivalent, and that matters.


10 Comments »

  1. The good news is that if major browser vendors enforced correct markup, authors would rapidly get a clue, just as they have been scrambling to fix pages that don’t play well in Firefox as the latter’s market share has grown.

    The problem with this argument is that major browser vendors would simply never choose to break 97%+ of exisiting content, let alone simultaneously, and even if they did, it would not have the effect that you describe. Any browser vendor who did refuse to render “broken” HTML would find that no user would be prepared to use their browser. Even if somehow all vendors decided to simultaneously release super-strict versions of their browsers, the net effect would be users sticking with their current web-compatible browser rather than taking an upgrade that would break the vast majority of the web. No doubt, in time, many sites would be fixed to adhere to the new requirements but there would still be a great deal of valuable legacy content that would work better in a less strict browser. What are the odds that every browser vendor would stick with the coallition when, by loosening up a bit they could capture a huge chunk of the market? Indeed, such loosening up need not be intentional; simple bugs in popular UAs can become depended on to the extent that the buggy behaviour must be copied by other UAs. Therein lies the heart of the problem; requiring strictness in content handling is requiring everyone to maintain an unstable equlibrium. It may sound good in theory but it can’t happen in practice.

    By far the most significant argument for XML on the web, however, is the complexity of existing HTML processors, a direct result of the compatibility hacks required to deal with naughty content.

    Hopefully that argument has been substantially weakened now the WHATWG has invested significant effort in documenting the behaviour required by a web-compatible HTML parser. That takes most of the effort out of implementing a HTML parser because you don’t have to come up with your own scheme to deal with the hard issues like residual style. Indeed the general feeling amonst people familiar with both the WHATWG spec and the XML spec is that it is not significantly harder to implement the WHATWG spec than the XML spec (XML has a lot of complexity to do with the DTD, entities in the internal subset, etc.), nor need the result be significantly less performant. Of course a HTML parser still has to do unintuitive things, like magically inserting tbody elements, at times but that is apparently not so difficult to understand that it has significantly affected the popularity of HTML.

    Comment by jgraham — 10/9/2007 @ 7:20 pm

  2. jgraham - I argued that an XML web has advantages over an HTML web, but I purposely avoided the question of how to get from here to there. :-) It’s a fascinating question, and you raise very substantial issues. Perhaps the rise of RSS holds a clue: suddenly a lot of web content is being delivered as valid XML in a kind of parallel web. The question is: how could we extend this trend so that more and more content is available as XML? In the long run, this might lead to a tipping point where content authors can no longer assume that their messy invalid HTML will be viewable by most visitors.

    Regarding WHAT WG: I totally agree, this is a great effort and certainly a highly pragmatic approach to a really thorny issue. A true XML web would still be better but even to a dreamer like me that’s years away, so a better HTML web would be a great way to tide us over in the meantime.

    Comment by Matt — 10/9/2007 @ 7:28 pm

  3. jgraham already pointed out the Prisoner’s Dilemma facing browser vendors trying to gain market share. Cooperate with the purity police while IE continues to defect? You lose.

    But I think there are severe usability problems with XML apart from this “how to get there” issue. Micah Dubinko and Oliver Steele have blogged about just the problems with namespace usability, and we see this all the time in SVG content, with E4X users, etc.

    What’s more, I contend that the Web is and will remain a human-crafted artifact, not mostly machine produced in its hypertext content, therefore error correction must be part of normative specs for its main text-like content languages.

    Finally, I dispute both claims in this paragraph: “The good news is that if major browser vendors enforced correct markup, authors would rapidly get a clue, just as they have been scrambling to fix pages that don’t play well in Firefox as the latter’s market share has grown.”

    Too many pages still don’t work with Firefox, and often the page author (a consultant to some bigdumbcompany.com) is long-gone. Sure, this may be a variation on the Prisoner’s Dilemma, but I wanted to point out that your hopeful statement here is over-optimistic, in our experience. The WHATWG work may pay off in a few years and change my reading of reality, but we’re not there yet.

    /be

    Comment by Brendan Eich — 10/9/2007 @ 9:23 pm

  4. Brendan - certainly any vision of an XML web is taking the long view. This isn’t going to happen any time soon.

    The problems with XML namespaces wouldn’t be too hard to fix. This isn’t to say that XML is perfect, but most of its flaws are man-made, not inevitable.

    I’m not sure that I agree about the web continuing to be hand-crafted. More and more pages are coming out of publishing engines: Wordpress, Facebook and Amazon to name a diverse sample. And I would tend to believe that where people are publishing markup by hand, they’ll use some sort of WYSIWYG tool. I know that usable HTML WYSIWYG editors have been a long time coming, but someone has to get it right eventually.

    Comment by Matt — 10/9/2007 @ 10:18 pm

  5. I am seeing people interpret “enforce correct markup” to mean render a blank page with an error message for bad HTML. People seem to think that supporting XML means tossing out HTML. It doesn’t have to be this way. Support both, completely, and if XML really does have a lot of advantages over HTML, the web will naturally become XML because the authors will take the path of least resistance and/or greatest advantage. These rewards must be significant, and significant to authors, not browser manufactures. Whether its the merits of the XML itself, which I am not sure is enough, or more likely the author-used-tools (NVU, etc.) and additional features (SVG, XSLT) that require XML. Unfortunately, many of the rewards are being held hostage by IE.

    IE needs to be fixed to play nicely with HTML served as XML, or XHTML. The other browsers seem to have their mime types in order, but until the big one does, we won’t seem much change. Maybe IE8?

    Comment by mawrya — 10/9/2007 @ 10:34 pm

  6. Matt, I find it interesting that your example of “valid XML” is RSS. Some hard numbers:

    1) About 7% of the feeds Google Reader runs into are not well-formed XML. See . Feed readers apparently vary in how they handle the problem. Firefox doesn’t enforce encoding validity (a bug); not sure about the rest. IE7 at least planned to enforce well-formedness, apparently: see . Not sure whether they stuck by that.

    2) A very large number of (popular) RSS feeds are not served with an XML MIME type. This has necessitated that Firefox sniff _all_ incoming content to determine whether it might be an RSS feed. Dedicated RSS readers ignore the MIME type altogether (treat everything as a feed), which is how we got to this mess.

    If this is the future XML web, I want out. ;)

    Comment by Boris — 10/10/2007 @ 5:27 am

  7. It looks like your blog software eats URIs? Let’s try the HTML markup approach, I guess:

    Google data

    Microsoft reference

    It would be nice if I could avoid having to type that icky HTML, though….

    Comment by Boris — 10/10/2007 @ 5:30 am

  8. WYSIWYG will save us, oh boy.

    First, there are lots of WYSIWYG editors that create invalid markup (some popular SVG ones led to a bugzilla bug asking us to impute xmlns= settings just to interoperate — this while SVG has tiny “web content market share”!).

    Second, so long as the web is alive in the sense that new combinations (mashups) of existing and new content can be cheaply created, whether or not tools are in the loop, you will get copy/paste injection of markup that violates well-formedness; and at the margins you will get (and should want!) hand-tweaking.

    I’ll believe the WYSIWYG utopia when I see it, which may be the same as saying when I kick the bucket (not saying whether it will be heaven or hell ;-).

    /be

    Comment by Brendan Eich — 10/10/2007 @ 6:32 am

  9. Everyone on MySpace wouldn’t be able to browse their own web pages.

    They out number XML freaks 1,000,000 to 1. Please get a grip! Open your eyes to reality!

    XML was a nice idea, but I’m a python coder and I still find XML to be verbose, ill defined, and mostly useless.

    .csv is the future baby.

    monk.e.boy

    Comment by monk.e.boy — 10/10/2007 @ 12:20 pm

  10. Boris - fair points about RSS, but I still think it’s a big step in the right direction when compared with the quality of the average HTML on the web.

    Brendan - well you’d have to agree that sites like Facebook (for profile pages) and Amazon (for reviews, comments, lists, etc.), among many others, hide the complexity of authoring HTML from average users. Perhaps we’ll never have a really effective WYSIWYG HTML editor, because WYSIWYG is simply poorly adapted to authoring markup (I don’t believe this to be the case, but I can certainly see the argument), but one future I don’t see happening is the average Joe or Jane authoring HTML by handcrafting tags.

    Comment by Matt — 10/11/2007 @ 10:21 am

Trackback URL RSS feed for comments on this post.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>


 

AllPeers File Sharing



AddThis Feed Button



Creative Commons License
This work is licensed under a Creative Commons License
Conestoga Street Wordpress Theme by Theron Parlin