RSS, Subscribers and Business Models (oh, my!)

Sunday, 25 May 2003

Tim Bray thinks out loud about mechanisms to allow RSS subscribers to be counted. His poison of choice is adding a query components to the URI in the Referrer header.

I don’t think that this is such a hot approach, because it’s breaking the semantic of the Referrer header; instead of being “where I got this link”, the Referrer becomes something closer to the User-Agent header, with user-specific information stuffed in. This messes up log analysis programs and causes yet another special case headache to deal with.

Also, “Invisibly” counting subscribers/eyeballs/whatever isn’t unique to RSS; people have tried to do this on the Web since day one, and some have even convinced themselves that they can do it. Solving it in a one-off way isn’t good.

But let’s brush that aside; I object to the notion that it’s necessary for RSS to have a business model. It’s already doing quite nicely without one. HTML certainly didn’t do too badly. If the benefits that the format brings aren’t compelling enough on their own, I don’t think a business model will help!

(Yes, I know that he’s writing about weblogs in particular, but in context he seems to be talking about RSS in general; RSS > weblogs.)

So what if people really, really want to do this? Easy fix. Send a different HTTP header with the hash information in it.

Heck, there’s one already defined: RFC2616, section 14.22 defines the From header, which seems to be tailor-made for this purpose. I even think there’s enough semantic slack in there to allow using a hash, as long as it’s also syntactically an e-mail address.

This way, Referrer continues to tell where the link was sourced from (in an aggregator’s case, referer shouldn’t be present when fetching RSS), and User-Agent stays specific to the agent, not the user.

For example:

GET / HTTP/1.1  
Host: frobnoz.example.org  
User-Agent: TheAggregator/1.0  
From: dfasdfawef@asddafaefafe.asfd

Generating the e-mail address by hashing the userinfo and each segment of the hostname. It’s not a valid e-mail address, but it is syntactically correct, and serves the desired function.

Another approach would be to define a new header that is solely “hash of user identity”, just as FOAF whitelisting does it. Maybe From-Hash?

Aggregator vendors, go forth and propagate… the From header, that is.

Update: Brent suggested the User-Agent header, and Tim likes it because UA tends to show up in Web server logs more often.

The only reason that this would be a problem, I think, is where you’re using a Web hosting provider who doesn’t allow you to customize the logs. We’ve heard this argument in various contexts (caching headers, previous discussions regarding the use of referrer), but it always boils down to this: hosting providers are the least common denominator of the Web, and therefore every protocol mechanism must stoop to their level. The Combined Logfile Format is your god; kneel before it!

I think this is short-sighted. Tim motivates this whole thing by saying that there are business cases which need subscriber counting; if that’s the case, won’t Web hosting providers respond to market forces by changing one line in their server configurations?

And what side effects will putting this information in UA have? As discussed above, I think this is a general problem, and therefore the solution should be generally applicable to the Web.

If this is true and UA is overloaded yet again, there’s all sort of potential mischief. What about Web logfile analysers that compile UA statistics - will they give each user their own bucket, making their statistics meaningless? What about those magic UA detection libraries for server-side scripts - will they mis-identify the agent? What about the sites that ban per UA - will they accidentally catch these because they’re not aware of the new format?

Nobody knows the answers to these questions, because UA is just a big glob of semi-formatted data which only some very specialized heuristics can even attempt to match. Adding yet more data to such a format isn’t going to help; it’s only going to have unintended effects on all of the current uses of UA (of which there ar plenty).

If this is really just for RSS, go ahead and use User-Agent, it’s not worth fighting about. I don’t think it is, though; a successful user-tracking mechanism will get used across the Web. If it’s going to happen on that scale, it should be done correctly, lest we shoot ourselves in the foot again.

Of course, I know that someone’s going to go ahead and do this anyway, so a couple of suggestions:

DON’T use XML inside an HTTP header; there’s already a syntax for them and embedding XML into it doesn’t bring any benefits, and a lot of headaches. The syntax for UA is:

User-Agent = "User-Agent" ":" 1*( product | comment ) product = token ["/" product-version] product-version = token comment = "(" *( ctext | quoted-pair | comment ) ")" ctext = <any TEXT excluding "(" and ")"> quoted-pair = "\" CHAR

So it’s likely that the comment (delimited by parenthesis) is the appropriate place. Perhaps something like

User-Agent: Foo/1.0 (Mozilla blah blah) (userhash=asfasdfdfadfa)

DO check with someone who’s written a HTTP header parser to make sure that the format you come up with is actually a legal HTTP header. Let’s avoid the mess that was Set-Cookie.
DO make sure it’s aligned with current practice (just one example) regarding U-A strings, so that it can be successfully combined with other agent strings.
For extra points, check with the existing browser capabilities community and see how loud they scream.

P.S. Tim, next time I make an objection, I’ll try harder. ;)

Mark Nottingham

RSS, Subscribers and Business Models (oh, my!)