mnot’s blog

Design depends largely on constraints.” — Charles Eames

Tuesday, 14 April 2009

Counting the ways that rev="canonical" hurts the Web

I had a lovely holiday weekend in Canberra with the family, without Web access. Perhaps I’ll blog about that soon — Canberra being in my opinion one of the nicest overlooked cities in the world — but that will have to wait. Going offline for a few days always brings a certain dread of what one’s inbox will hold when you get back, and this one was no exception.

That’s because while I was watching the kids rolling down the grass slope on top of Parliament House, rev="canonical" started to gain some serious momentum, billing itself as a way to shorten URLs that “doesn’t hurt the Internet.” In my opinion, this is an interesting idea with an very unfortunate execution that’s bad for the Web, and I’m going to enumerate the reasons here.

1. Misapplied Trust

If a resource with URL A has a rev="canonical" link to URL B, A is essentially saying that it’s the canonical URL for B. In other words, anybody who uses that information is trusting A to make assertions on behalf of B. A naive consumer of these links will allow A to put words in B’s mouth no matter what their real relationship is; http://evil.attacker.org/ can say that it’s the canonical link for http://innocent.bystander.com/.

Or, more subtly, http://example.edu/~user1/ can say that they’re the canonical link for http://example.edu/~user2/. The important thing to note here is that A isn’t asserting what it’s relationship to B is; it’s asserting what B’s relationship to A is — which it may or may not have the right to do.

An easy answer to this is that “we only are using canonical to mean that it’s a short link” — but the point is that the canonical link relation already has a de facto meaning, and it’s not being used for that purpose. Reusing canonical for this purpose only dilutes its semantics, reducing its value.

2. Rev is a Trap

#1 scratches at the surface of a much deeper problem — that the rev mechanism is very powerful and very tricky, because while it doesn’t change the semantics of a link relation, it does change the relationships between the parties, with many consequences that aren’t obvious. Compounding this confusion is the single-letter difference between rev and rel; people often use them interchangeably.

99% of the time, rev gets people into trouble, and this is both the reason that it never really took off, and that both HTML5 and my Link draft have deprecated it. Using rel and a separate relation is much clearer and much less prone to misinterpretation.

3. Unilateral Action

Finally rev="canonical" has been launched as a Web site, a blog, and a Slashdot article, but AFAICT zero discussion within the communities that care about this; HTML5, HTTPbis and without coordination with the people who defined* the canonical link relation.

Launching a new library, service or Open Source project with these sorts of Web 2.0 marketing techniques is pretty much business as usual these days, so it’s understandable that the same techniques have been used here.

However, it’s important to understand that protocol and markup elements aren’t a standalone project — they’re very much the shared commons that keep us communicating with each other, instead of past each other. By unilaterally repurposing the semantics of an existing element, the already shaky agreement that our computers have when talking to each other just got shakier, with another special case.

Some Suggestions (in both directions)

OK, enough pointing out what’s wrong. The idea of rev=”canonical” is a good one; the only thing that really needs to change is the syntax. Something as simple as rel="shorturl" should do the trick — i.e., allowing URL A to assert that it’s also available through URL B, which is shorter than A.

It does appear that some people have made that suggestion, but because the discussion has been spread across Twitter, at least one Google Group and countless blogs, it’s impossible to tell what the real state of things is. I’ve seen at least one example of someone not agreeing with the rev="canonical" approach, and as a result starting a new group to discuss an alternative, to “come to consensus.” The problem, of course, is that that’s the consensus of a very highly self-selective group, and not representative of a wider community. This is where reusing established infrastructure such as the IETF APPS-discuss list or the W3C www-talk list would come in handy.

To be fair, the means of extending the Web in this fashion aren’t readily apparent to those that aren’t part of the process, so it’s not surprising that they just went and tried to do it. We’re trying to fix this somewhat for links in the link draft, but I’m sure it could do a better job. Any suggestions are welcome on either to me directly, or on the HTTPbis list.

Stepping back, I think this sort of thing is going to happen more often, not less. Microsoft and Netscape unilaterally extended the Web with MARQUEE and BLINK, and it was ugly, but the impact wasn’t nearly as bad as countless Web developers all extending the Web in their own way could be. The onus is clearly upon organisations like the W3C and IETF to make themselves as transparent and approachable to developers as possible, so that the latent experience and expertise in them can be drawn upon by these innovators, instead of being seen as either irrelevant or impediments.

* disclaimer: I work for one of them, but have nothing to do with that department; I found out about canonical after they announced it).


Filed under: HTTP Standards Web

30 Comments

DeWitt Clinton said:

I came to the same conclusion myself about rev="canonical".

How do you feel about rel="alternate" type="text/html", plus the link header extension? If aggregators look and they find a short-enough alternate URL for the resource they can use it, otherwise the links will just be ignored and no harm done.

Thoughts?

-DeWitt

Wednesday, April 15 2009 at 1:27 AM +10:00

jim said:

Maybe I'm missing something, but I think your first point is wrong. Rev="canonical" doesn't work across domains. From google: "Google currently will take canonicalization suggestions into account across subdomains (or within a domain), but not across domains. So site owners can suggest www.example.com vs. example.com vs. help.example.com, but not example.com vs. example-widgets.com."

Wednesday, April 15 2009 at 1:44 AM +10:00

Colin G said:

@jim,

Look at mnot's second example "http://example.edu/~user1/ can say that they’re the canonical link for http://example.edu/~user2/" It is within domain so it would be accepted. Yet user2 may not have the right to make assertions for user1.

Wednesday, April 15 2009 at 2:02 AM +10:00

Philipp Lenssen said:

Jim, for what it's worth, the quote from Google you're citing is in reference to rel="canonical", not rev="canonical"...

Wednesday, April 15 2009 at 2:05 AM +10:00

DeWitt Clinton said:

One last quick thought: if rel="self" makes its way from Atom to HTML5, then that would be preferable over rel="alternate", as the resource itself should be identical at both URLs. Both are preferable (imho) over a new link relation value just for changing the length of the URL.

Wednesday, April 15 2009 at 2:14 AM +10:00

Paul Hammond said:

I'm not a huge fan of the rev="canonical" syntax either - a rel="shorter" syntax seems clearer and more explicit to me.

Still, it feels like you're being unfair to the people involved when you complain about the lack of IETF/W3C involvement. From what I remember, Atom started out as a weblog post (http://www.intertwingly.net/blog/1472.html) and wiki. A new mailing list was set up, tools were built and a full spec was published long before the project moved under the IETF.

What was different then?

Wednesday, April 15 2009 at 3:05 AM +10:00

Ian Hickson said:

mnot: Entirely agreed. This kind of thing is why I'm skeptical of the proposals to allow people to unilaterally extend HTML, too, though the people pushing for that don't seem to agree. :-(

Wednesday, April 15 2009 at 3:52 AM +10:00

Sam Johnston said:

G'day,

Good to see some "real" standards heads getting involved. As the person who kicked off the shortlink Google Group I can assure you that the purpose was to get these guys in the same room to get consensus between themselves before submitting something to IETF/IANA/W3C/WHATWG/etc. I'm not getting very far though - it seems the main guys are mates and they're still very intent on flogging the 'rev="canonical"' dead horse.

I think rel="shortlink" would do the trick (short_url/short uri/shorturl/shorturi/etc. is too confusing) and also think there's some value in having a dedicated link relation, leaving rel="self" for what it was intended for (e.g. giving a link to that specific resource, complete with search, category, highlighting and other cruft that should be kept out of rel="canonical").

I've whipped up something of a shortlink specification which explains exactly how the thing should work... feedback welcome here, there, privately or in the group.

Cheers,

Sam

Wednesday, April 15 2009 at 4:40 AM +10:00

DeWitt Clinton said:

@Sam - the arguments in favor of rel="shortlink" are understood, but be prepared to meet with some healthy resistance with the introduction of *any* new rel value when an old one will do. Especially in the case of a new rel value that exclusively serves the needs of something that a fair number of "real standards heads" are probably highly skeptical of in the first place (i.e., URL shorteners).

Wednesday, April 15 2009 at 4:47 AM +10:00

Sam Ruby said:

Ian, if by "people" you are including me, I think you are mischaracterizing the argument. Rather than rehashing, here are two assertions:

1) Whether or no you, me, the IETF, the W3C or anybody else "allows" somebody to unilaterally extend HTML is immaterial. People will do so if they find that it serves a purpose.

2) The current WHATWG draft already enables a number of extensions: http://wiki.whatwg.org/wiki/FAQ#HTML5_should_support_a_way_for_anyone_to_invent_new_elements.21

In this specific case, I think that the evidence shows that the extensibility provided by rel is more than sufficient for this need, and that there are issues with rev that should be seriously considered. Kudos.

Something that should be considered for the HTML draft: Mark indicated that both his draft and HTML5 deprecate the rev attribute, in actuality it is only his draft that does so, HTML5 omits the attribute entirely. I think that the intended audience would be better served if this attribute were document and deprecated, and the rationale for doing so was included in the HTML5 spec.

Wednesday, April 15 2009 at 4:48 AM +10:00

Sam Johnston said:

@DeWitt: That's fine but I'm not sure "self" does meet the need, at least not without losing other (possibly important) functionality.

As for short URLs, there is certainly a need for them (as evidenced by the sheer number). Outside of arbitrary limitations imposed by microblogging you have mobile Internet to think about (e.g. SMS) and of course the constraints of the physical world (e.g. any time you need to manually enter a URL, like when it's printed or spoken).

I'd argue that specifying different links for self, canonical and shortlink in the same document makes sense in some instances, for example:

self: http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678
canonical: http://www.example.com/product.php?item=swedish-fish
shortlink: http://www.example.com/swedish-fish

Sam

Wednesday, April 15 2009 at 5:12 AM +10:00

kellan said:

As folks have addressed the idea of these sort of conversations emerging in the IETF/W3C seems pretty silly. Rough consensus and running code, and all that.

That said, as I've stated, I'm not particularly committed to rev="canonical". Much like Les I'm more interested in the idea. RevCanonical has done its job of being both interesting and catchy. (I've got hard quantifiable data that if the proposal had been say, rel="short_url" we wouldn't be having any conversation at all)

Lastly I think any variation of rel="alternate" & type="text/html" is problematic, and a header based solution is unlikely to get adopted.

rel="possessor_of_a_shorter_uri" would be fine, or one of many variations proposed.

Wednesday, April 15 2009 at 6:17 AM +10:00

DeWitt Clinton said:

@Kellan - I'm worried that a non-header based solution is even more unlikely to be adopted. Parsing HTML may be out of reach for a site like Twitter, which has to process 1000's of URLs per second.

Actually, even making HEAD requests might be out for them as well. Best case fetch-and-parse time is still on the order of 1s (so 1000 parallel fetch and parses going on at all time), and worst case is closer to 10s or even a timeout. And those requests will queue up before the message can be delivered, I'm not sure how we can reasonably expect sites like Twitter to use rev="canonical".

Wednesday, April 15 2009 at 6:57 AM +10:00

l.m.orchard said:

@DeWitt: On the other hand, a header-based solution is out of reach for a good number of content publishers who can't configure custom headers.

The prior art for the HTML parsing, though, is feed auto-detection. I'm willing to bet it's not much worse than using a 3rd party shortener API.

@Kellan: What you said, with my only addition being that I'm actually interested in generalizing the scope of this beast beyond just shorter URLs to allowing publishers to provide alternate URLs that satisfy a range of criteria (including shortness, readability, mobile device entry, etc)

Beyond that, I don't have more enthusiasm for rev="canonical" if something else proves more useful. If anything, I like that it's more general than rel="short.*"

Wednesday, April 15 2009 at 8:07 AM +10:00

Julian said:

@DeWitt: Why would Twitter need to parse the HTML? On Twitter, short URLs are used, they contain the original link in a Location header.

Wednesday, April 15 2009 at 8:27 AM +10:00

Sam Ruby said:

re: I'm willing to bet it's not much worse than using a 3rd party shortener API.

I'll take that bet. :-)

Twitter doesn't support an arbitrary 3rd party shortener API, they support a specific 3rd party shortener API. In fact, for all I know they have a special purpose interface, but even if they don't, the general purpose interface produces data that is (literally) trivial to parse. Here, let me demonstrate:

http://tinyurl.com/api-create.php?url=http://intertwingly.net/blog/

One thing I've often wondered: if my "tweet" already fits in 140 characters, why does twitter bother to shorten my URI in the first place?

Wednesday, April 15 2009 at 8:31 AM +10:00

sethaurus said:

REL is to `GOTO` as REF is to `COMEFROM`.

Wednesday, April 15 2009 at 11:34 AM +10:00

l.m.orchard said:

@Sam: Hmm. This is why I don't play cards for money. Or at all, really.

Wednesday, April 15 2009 at 1:35 PM +10:00

Mark Nottingham said:

DeWitt - I think a new relation is probably justified; 'self' is different, it points to "this." Something with "short" in it is probably appropriate. OTOH if people really want to use an existing relation, 'alternate' is probably the best -- after all, they can count the characters in the URL themselves to figure out how short it is...

Paul - it's not that I want W3C/IETF involvement, per se, it's that it needs to be discussed in an open way where people who have both the interest and expertise can comment. This was very telling in some of the Twitter discussions I saw, where it was quite clear that the design of rev="canonical" was pretty much a shot in the dark. FWIW, I didn't intend to be any harder on them than on the IETF and W3C for not being approachable...

The difference with Atom, BTW, was that it started with discussions, and there was an active effort to make sure the right people turned up; it didn't start with implementations first, questions later.

Kellan - are you aware of the irony of your first statement, considering where the phrase 'rough consensus and running code' originated?

Interesting and catchy are great marketing attributes, not technical ones.

Wednesday, April 15 2009 at 1:50 PM +10:00

DeWitt Clinton said:

@l.m.orchard - but feed URL autodiscovery doesn't happen at the scale or speed of incoming Twitter messages. I would love to hear one of their engineers weigh in about whether they can fetch and parse every URL they see in a tweet to look for link tags. In real time.

Sam Pullara suggested YQL (which is very neat!) -- but can even YQL handle the 1000's of qps of fetch/parses that Twitter would require?

My thesis here is that if Twitter doesn't implement it, we're probably not solving the problem in the right way.

Wednesday, April 15 2009 at 2:26 PM +10:00

kellan said:

@mnot I've even heard David Clark say it.

Wednesday, April 15 2009 at 3:23 PM +10:00

Sam Johnston said:

@mnot: "alternate" refers to the content itself, not the link to it. The same can be said for adjectives like "short" and "shorter". Further, "short_url/short url/short-url/shorturl/shorturi/etc." are too confusing which is why I settled on rel="shortlink" (which is gaining in popularity). There's a bunch of information in its Google Code wiki.

Sam

Friday, April 17 2009 at 12:31 AM +10:00

Bill de hÓra said:

"One thing I've often wondered: if my "tweet" already fits in 140 characters, why does twitter bother to shorten my URI in the first place?"

I assume because there's no decision logic. Being code, it would have to bother to check.

Friday, April 17 2009 at 5:29 AM +10:00

Erik Vold said:

@mnot In set theory, the term "canonical" identifies an element as representative of a set. Therefore rev=canonical would mean the canonical represents the href, not what you said in #1. Furthermore the page would have rel=canonical if it was not the canonical, and different processes of canonicalization (c14n) can then be performed, such just trust or trust+verify.

rev=canonical would not have to be used for every url in the canonical's set, the publisher would only need to point out the rev=canonical's that may be of interest to the user, and the user can pick a short one.

The difference between rev and rel is obvious, and you don't give people enough credit for the long term. The diff is night vs day, black vs white, it's obvious, and the internet is still a baby, give it time to understand. Don't confuse publishers by having @rev in html 4 then deprecate it in html 5, that causes needless confusion, because people will I don't know.. argue about rel equivalents like rel=short* and end up with values that are less than the original value (rev=canonical). And are you seriously bringing up the one character issue? you are a programmer correct? that's #2 dealt with. I also want to say that with @rev isn't the inevitable spiral of rel values obvious?

For #3 I point to exhibit A the foolish deprecation of @rev and exhibit b the foolish response to rev=canonical as evidence that those writing the standards are going off track..

The early rev=canonical adopters were correct, the opposition rel=short* adopters were incorrect, and in a state of confusion. IMO.

@l.m.orchard very cool ideas!

Tuesday, April 21 2009 at 10:23 AM +10:00

Mark Nottingham said:

I think I'm going to just let that last comment stand on its own merits.

Tuesday, April 21 2009 at 2:45 PM +10:00

Erik Vold said:

I am interested in hearing a counter arguments.

Tuesday, April 21 2009 at 3:00 PM +10:00

Erik Vold said:

I have written I more complete rebuttal here: http://erikvold.com/blog/index.cfm/2009/4/21/rev_canonical_good

Tuesday, April 21 2009 at 4:16 PM +10:00

Sam Johnston said:

@Erik: Les has since dropped his support for rev=canonical. He's got some useful feedback and good ideas around giving users choice (e.g. http://arst.ch/pg vs http://arstechnica.com/gameboy) but in reality the results are always short (shorter than the canonical URL anyway).

I think Mark is right to let your comment "stand on its own merits"... claiming that @rev=canonical constitutes "evidence that those writing the standards are going off track" is utterly ridiculous and rather offensive given he happens to be one of those people (who do you suppose wrote the ID we're using to expose these links over HTTP?).

Anyway just in case there was any doubt that using rev=canonical (or rel=short[_- ]?ur[il] for that matter) is harmful I'm giving warnings for both at http://rel-shortlink.appspot.com/.

Sam

Tuesday, April 21 2009 at 8:52 PM +10:00

Erik Vold said:

@Sam well if what I said is considered offensive to anyone, I would say they are far to sensitive to be on the internet, and they should ignore me.

Wednesday, April 22 2009 at 2:01 AM +10:00

Tab Atkins Jr. said:

@Erik Vold

Okay, you want a rebuttal. Here goes:

1. Using unrelated concepts to try to prove things doesn't work. @rev=canonical has absolutely nothing to do with the set-theoretic idea of canonicalization. It's simply supposed to be the reverse of @rel=canonical. The set-theoretic idea may of course come into play when defining @rel=canonical. In this case, it doesn't, at least in the way you are defining it. @rel=canonical simply says that the page you are currently viewing, no matter what craziness may be in the actual url you followed, has a canonical url of "foo". It's not a relation from member to set, but rather from member to member (where the set is "urls that can summon up an essentially identical page").

2. The difference between @rel and @rev may be obvious, but it's not intuitive, nor is it reliably used in practice. Actual studies of @rev use show that it is *massively* misused or badly used.[1] As well, @rev's overall use is miniscule compared to @rel. The confusion stemming from the tiny number of authors who actually use @rev (and the even tinier number who use it *correctly*) is nothing compared to the confusion *saved* by removing it from the language. Any place @rev appears you can substitute @rel with an opposite meaning; @rev is purely an attribute of convenience, and experimental evidence shows that this convenience is far outweighed by the inconvenience caused by authors using it incorrectly. @rev is polluted to uselessness right now, and there's no indication that anyone cares enough to make an effort to unpollute it.

3. @rel=canonical has some very particular implications that are *not* safe to reverse. It is an assertion by url A that url B is its canonical form; it's asserting that you can use url B and be sure that you'll get essentially the same resource. This is fine wrt trust, because url B isn't hurt by other resources claiming it as their canonical form; if the assertion is fully trusted, url A just gets redirected to url B, hiding any distinct resource that would have been pointed to by A. When you reverse the relationship, though, you get url A asserting that *it* is the canonical form for another url B. It's making an assertion about another resource, and there's no way to tell that url A has the authority to make such a pronouncement. @rel=shorturl, while functionally similar to @rev=canonical, is semantically distinct - it's url A offering a suggestion that you may use the shorter url B to get to it. The resource pointed to by url A (and hopefully B) can then use @rel=canonical to define its own canonical url C, which may be distinct from A or B.

[1]: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-November/017353.html

Wednesday, May 20 2009 at 12:49 PM +10:00

Creative Commons