a question of semantics

While the overall discussion is populated by far more knowledgeable persons than myself, I’m going to poke my nose into it anyway.

The discussion revolves around issues with current syndication formats (see also: content distribution services); namely RSS and RDF.

Since I can’t say that I fully grokked all the nonsense about what brought everyone to this juncture, my first response was, “WTF?!”. This was swiftly followed by, “There’s no better time to find out.” I have since spent the last few days wading through books, specs, and expert opinions on the subject. While data input does not equate to data absorption, and certainly doesn’t elevate me to expert status, I did stay at a Holiday Inn last night.

My thoughts are as follows:

Anil makes a nice point but…

…it’s ultimately unworkable. While it certainly is possible1, it adds a whole new layer of complexity for people who, as Mark Pilgrim put it so tactfully, “…can’t even spell XHTML.” The average blogger’s efforts, and this is not meant to be demeaning, are concentrated elsewhere. (X)HTML authoring is not his/her primary concern. In most cases, the bulk of the authoring has been transferred to popular blogging tools. Tools which also currently author, and generate these syndication formats without asking the blogger to lift a finger. To instead, ask the average blogger to accept and shoulder this responsibility (and it would require this), as well as confine him/herself to a given format is, in my opinion, an unrealistic expectation. You think your feeds are broken now? Wait until you start scraping all those non-validating XHTML pages.

The tools themselves should be taking on more responsibility.

CMSs/Blogging tools are evolving daily, so why aren’t people looking to them to properly generate their syndication formats? Why should the average blogger have to worry about whether or not his/her blogging tool of choice is generating a proper XML/RSS/RDF document? A better question than some of the ones being asked is, if we can scrape a XHTML page and transform it into RSS/RDF, why aren’t the tools scraping the content prior to output, and turning any (X)HTML in an entry into valid XML/RSS/RDF? The only requirement put upon them at the moment is to meet structural validity. It is up to the user to make sure his output is well-formed enough to be useful and either stripped of (X)HTML elements, or properly marked as CDATA2.

Why not create a syndication interface, where a valid document can be created on the fly, based on criteria input by the user? A couple of quick questions regarding what data is to be output and what format, and the user never has to see an RSS/RDF template. In fact, taking the whole idea to its logical conclusion, why not build a CMS/Blogging tool that is based entirely on XML? A tool that requires the user only input data, no (X)HTML. It would require a complete rethink of the typical blogging interface, but ultimately it opens the door to transformation of data into whatever format desired.

Why is everyone stuck on using RSS/RDF?

RSS seems to be doing what it was never designed to do. It’s been adapted, adopted/co-opted, hacked at, and extended, all the while having to remain a ultra-liberal framework. It was ideally suited to its original purpose of syndication. Syndication in the form of headlines and brief descriptions of the latest topics from a website. Weblogs have pretty much blown this concept out of the water, and it’s been a scramble to bring RSS up to the task. What RSS has the unenviable task of doing now, is finding ways to completely repurpose content in all its varying degrees. My question is, why all the focus on RSS?

“Blogging” is so all-encompassing in scope that it’s hard to define what exactly it is, let alone create a framework that breaks it down into bite-sized chunks of data. For the most part, the real meat and potatos of the blog, the entry, is stuffed inside either the description element (which seems to me so semantically incorrect it’s funny), or in content:encoded where it’s treated as the bastard child of the document, since no one really seems to know what else to do with all this mixed content. I keep wondering why no one seems to be tackling the the job of transforming it when/where necessary. After all, we’re after metadata here aren’t we? I would have thought something along the lines of TEI would be ideal as the basis for encoding with an eye towards syndication, since blogs, if they are anything, are a narrative. It certainly seems more appropriate to me to think about properly encoding the data first, and then syndicating it second. The current arguments seem to be rallying around RSS because it’s this great syndication format that everyone is already using, but am I wrong when I say it’s just a format? I could syndicate my ass if it was an application of XML that was understood by an aggregator. All it takes is agreeing on a common format (convention) for communication. [ed. note—Speaking of aggregators, the HPANA confounded us when we went browsing for aggregators. I’m happy they took the time to point out that IE5.5 has flaws, but we feel that time might have been better spent perfecting the site’s CSS.]

Metadata, Auto-Discovery and Aggregators

There’s some interesting stuff going on at the WMDI, and I’ve been thinking that marking up your weblog with metadata like this is an interesting possibility for aggregators to auto-discover feeds and subscribe while searching. I don’t see the point in encoding more data than an application would need to discover where it is and where it needs to go to get data. Search engines could be directed to content in the same manner. Instead of scraping entire sites, they could be pointed to XML feeds via metadata where they could soak in raw data without extraneous formatting. Similarly, the content of Flash sites could be indexed and accessed as an XML feed by non-Flash-enabled applications.

Worlds within worlds I tell ya’. Then again, I’m sure the people that dreamt this stuff up see it much more clearly than I do.

Related Reading:

XHTML For Syndication





  1. Site Summaries in XHTML/HyperRDF – an idea first introduced by Dan Connolly on the RDF Interest group mailing list, Tue, 21 Mar 2000
  2. The Next Logical Step for RSS – Timothy Appnel

Leave a Reply

You must be logged in to post a comment.

I am the survey and so can you

The survey for people who make websites

Master and Slave Considered Harmful

Lets play a little game I like to call Senseless Acts of Political Correctness.

LOS ANGELES, California (Reuters) — Los Angeles officials have asked that manufacturers, suppliers and contractors stop using the terms “master” and “slave” on computer equipment, saying such terms are unacceptable and offensive.

Now you try.

ppk spanks ALA and other picture postcards

In a scathing article, ppk criticizes both ALA and Christian Heilmann for the ALA article JavaScript Image Replacement.

On a lighter note, in case you haven’t seen it—Amazing 3D sidewalk paintings [via mefi via 37signals].

the Matrix Unplugged

Everything that has a beginning has an end. Well, this ending was deeply disappointing.

I was prepared to see the best of the trilogy. What I saw was filler that didn’t answer my questions, didn’t tie up what I saw as loose ends, and didn’t serve the storyline of the main characters, which were the ones I had a vested interest in.

Minor spoilage ahead

Morpheus needn’t have bothered showing up for Revolutions. This character, this main character was about as useful to the story as the woman in the red dress from the first picture. He was there because he had to be. Because it wouldn’t have been the Matrix without Morpheus.

As for Neo and Trinity—I didn’t care for where their stories went either. To my mind it was an injustice to who these characters are, and the paths that brought them here. Actually, that goes for all the main characters. It was a travesty what Revolutions did with all the main characters. There was more screen-time and plot devoted to minor and newly-introduced characters than there was to the main four.

There’s even a scene in Revolutions that’s right out of The Perfect Storm. I was flabbergasted.

Unless you’re a die-hard, and really need to see this movie, expatriate yourself from the Matrix.

God, I’m so disappointed.


We likey the new Quirksmode. Thank you PPK.

pull quotes w/the DOM

Just a little DOM experiment in pulling a section of text out of a paragraph and making it a pull quote. What it does is use the EM element with a particular class to flag the section of text you want pulled, and then it formats it within a BLOCKQUOTE and inserts it before the paragraph it was pulled from.

Come on baby light my FIR

So there I was staring the FIR method in the face and asking, “Why Ziploc® the image in the CSS?”

It didn’t seem like reusable code to me. For every piece of content or text you wanted to use the technique on, you needed to add another declaration in your CSS. It becomes sort of impractical to use on dynamic sites, where content changes frequently. Don’t get me wrong, I love the idea from the standpoint of the issues it addresses, and the cleverness with which it does it, but the appeal wears off if I can only use it statically. I mean, I can’t really set this up for all H2 elements on a page, and I can’t apply it to a whole class of elements. Even the Revised Image Replacement techniques, culled at mezzoblue, don’t allow for dynamically achieving this effect. So what’s a codemonkey to do? Well, recode of course!

I went back to the beginning and asked, “What is the goal here? What points are we looking to address?” Well, we’re looking to achieve the following:

  1. Enhance the graphic presentation of pages through the use of images
  2. Keep the content of pages accessible to assistive browsing software, text-only browsers, as well as browsers with images disabled
  3. Promote the proper indexing of content by search robots, and other such data aggregators
  4. Keep the code/markup to a minimum, and avoid adding semantically ambiguous markup

I kept thinking, “Why not use an IMG element for an image?” After all, we’re trying to establish a technique for adding images to a page, while allowing the actual content, the text, to remain accessible. Why remove the IMG element from the document, only to code the image in elsewhere? Why not start with both, and then pseudo-eliminate the text? Since the code for placing images in pages already exists in (X)HTML, why not use it, and just think differently about its purpose for being there?

The purpose, in this case, is largely decorative. It conveys information to those that can view it, but it is supported by textual content, for those that can’t, or don’t wish to view images. It’s no longer the sole responsibility of the image to convey all the information. It now does so jointly, and doesn’t need to be defined in the same way a separate image might. By this, I mean alt or title text. The image needn’t be accessible, since it is there as a decorative enhancement, and is backed up by text, so treat it like any other decorative image and leave the alt attribute empty.

Point of fact, eventually, we are all going to be coding accessible image alternatives in a very similar fashion when we switch from using the IMG element, to the OBJECT element. [ed note—Yes, we know this is the second Mark Pilgrim document we’ve linked to. It couldn’t be helped.]

So instead of adding SPAN elements, and display:none to hide the text, I opted to simply cover it up with the image; similar to the Gilder/Levin method. [ed—We swear we only just noticed the similarity!]

There are two main flaws with the cover-up method. The first problem arises when using transparent images. It’s sort of difficult to hide something behind a pane of glass isn’t it? While I believe it’s possible to work around this in many situations, there will be those times when it simply won’t be workable.

Second, the possibility exists that the text you are trying to hide is longer than the image you are trying to hide it with. Again, this is a situation that can be worked around, but there is always that exception where it will be impractical. The caveat to using any method is to first make sure the implementation fits the needs of the project.

I sent my little mock-up to Dave Shea, just to see what he thought, and though he thought it was interesting, he didn’t feel it added much over using just an image. The reality is that’s true, but it’s true of the entire concept. However the basis here is still the text, not the image. And, this method uses a semantically correct element to insert an image into the document, still uses CSS to alter the presentation, and allows you to use it on any element or class of elements.

I know it’s just a cheesy little mock-up, but it may be a springboard to some other ideas. I may add some other examples and documentation to that page over the weekend.

Opera’s big mistake

Update!—Turns out I spoke too soon. This problem occurred in version 7.20 build 3087. After finding this little problem last Thursday, I submitted a bug report to Opera. The latest version 7.20 build 3144 does not experience this problem. I’d rather see them rev the version rather than silently change the build number, since they do not specify the build number on their download page. Anyway, this pretty much negates anything I subsequently had to say in this entry, so feel free to skip over it.

At least that’s the way I see it.

With the release of version 7.20, Opera has made a fundamental change in the way that their browser treats XHTML documents, and in my opinion the changes they have made are incorrect, and are ultimately changes that will hurt them, rather than help them.

I first noticed something was weird, after my recent upgrade at work. A script I had been working on for a website suddenly stopped working. I was pretty sure it worked prior to the upgrade, and after checking it on my home PC, I confirmed that it had. Now the trick was figuring out why it stopped working in this latest version of Opera. Was there something wrong with the code? I’m no JavaScript expert, but I was pretty sure that the code was valid. The script was based in the DOM, but wasn’t doing anything extensive, or highly manipulative. [ed—High levels of programming are out of our league.]

Validation is always a first check, and after confirming that, since I was working with XHTML, I decided to remove the XHTML headers to see if there was something about jumping into strict mode that was messing things up. Sure enough, the minute I removed the XHTML headers, the script worked again. I then even managed to narrow the problem down to the URI declared in the xmlns attribute of the head element. If I removed the URI, the script still worked. Now I was off to Opera’s website to see what changed.

Referring to their specs page, if you scroll down to the section titled XML namespaces you will see the following:

The XHTML namespace (http://www.w3.org/1999/xhtml) triggers XHTML handling in Opera.

Well what the hell does that mean? If they already support DOCTYPE switching, why do they need to trigger XHTML handling by way of the namespace? Especially since according to the XHTML spec, a namespace declaration in the form of the xmlns attribute is required for strict XHTML document conformance. It’s not as though I can leave that out. That aside, what does Opera mean when they say XHTML handling?

Again, referring to their specs, their support for XHTML is extensive, although for some odd reason they state that they don’t support the script element. Was I asleep when they deprecated the script element in XHTML? As far as I can tell, the script element is part of the XHTML specification, and Opera should support it. But I digress.

All in all I didn’t really find anything at the Opera website that told me specifically what had changed. I just came away with the general sense that they were treating XHTML pages very strictly, and like XML, so it was back to my script to see if I could debug what Opera was doing.

Where I found my script was failing was when I was looking for nodeNames. I was doing a pretty standard evaluation against a nodeName and I was getting nothing. In reality, it wasn’t that I was getting nothing, it was that Opera wasn’t finding a match.

In this example page, Opera (v7.20) fails to evaluate to true for the five P elements on the page. The piece of code that no longer works is the following:

if (obj.childNodes[j].nodeName.toLowerCase() == 'p')
	//Do Something

Why? Because Opera is no longer returning just the nodeName in XHTML. It is returning the nodeName plus its namespace prefix (example page 2, ). So instead of returning just p it returns html:p. Now for the first mistake. I’ll grant you that technically speaking, treating the document as XML, returning the namespace prefix is probably correct, however, HTML is not the namespace of the document or the element in question. XHTML would have been the proper prefix to return. Under XHTML, these elements are not HTML elements, they are XHTML elements. XHTML is a reformulation of HTML in XML. XHTML is not XML with HTML elements in it. So right there, Opera is wrong.

Next, returning the prefix is basically incorrect for XHTML, since it currently does not support multiple namespaces. And is this DOM 2, or are we only talking about DOM 1 here?

Lastly, with the vast move to XHTML, unless designers and developers are aware of this change, Opera is going to be breaking scripts all over the web, and designers/developers are not going to be too happy about rewriting their code just to support Opera.

Am I incorrect in my assessment? Is Opera approaching this correctly, and is it the thinking behind the scripting that’s wrong? How about some of the DOM gurus chiming in. By all means, somebody correct me if I’m wrong, ’cause otherwise I’m just going to dismiss Opera as a viable browser again.

Form styling with CSS and the DOM

So to make up for lost time I thought I’d share another little thing I’ve been playing with. [ed—No, it’s not that!] It’s a CSS/DOM enhanced form.

The form is basically styled with CSS, and then I use the DOM for couple of subtle enhancements. What the DOM script does is look for all INPUT and TEXTAREA elements on the page, and then set the default background color, as well as a background color for when the element has focus. It filters for text input fields and leaves button types alone. It will also set the initial value for the field, assuming you have a value in the title attribute. For accessibility reasons it’s best to set the initial value anyway, and if you do, the title attribute should match. When the element has focus, the value of the field will clear, and unless you change the value, it will reset to its default (the title attribute).

It’s just a little DOM fun, and it could probably be cleaned up some, and actually made a bit more modular, but hey, I never promised you a rose garden. Feel free to play, tweak, use, whatever. Usual disclaimer applies; meaning it works on all my Win browsers. Mac users are on your own.

CSS table-ish list

So I was trying to work out a CSS version of some information that was previously formatted in a table. This CSS formatted list is what I came up with so far. [Ed note—Although coded using fixed widths, it can be modified to be variable width. And don’t think we don’t know how to do it!]

The CSS works in all my Win browsers (IE, Moz, Op, Fb, NN4.x excluded), which means it probably fails miserably on Mac. I had originally coded the CSS slightly differently, but it failed in IE5.5, so a couple of tweaks and boom one more browser nobody uses anymore.

It doesn’t look like much less code to me than the table version, but I do think it’s more semantic, has more going for it from an accessibility standpoint, and I hope y’all can make some use of it.

Update! So I re-checked the page in Opera on my home PC and everything looked fine. It’s virtually identical to my work PC as far as OS and browser versions, so I can’t understand why I saw a difference at work.

In any case, I didn’t update the page, so by way of compensation I added the variable width version. [ed—Like that was hard!] I wouldn’t necessarily say that it stays crispy in milk, but it’s good down to a width of about 400px.

Update 2! It turns out that my work PC had Opera v7.11 installed, while my home PC had v7.03. Opera 7.11 showed a little glitch in the presentation, but an upgrade to 7.20 fixed it. Yay for Opera, and yay for my browser addled brain.