[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] robots.txt and rss



> Not true; the whole point of robots.txt is that I can say "if you identify
> yourself as the U-A Foo, don't go in this directory."

Yes Mark, robots.txt used in this manner would block any *user_agent* with that
string.  And since readers generally all share the same vendor user_agent this
would end up blocking any and all users running that software.  Probably not a
very friendly thing to be doing.  And before anyone runs off thinking about how
some reader programs abuse the referral header, robots.txt does nothing support
it.

> The problem is that robots.txt isn't any more refined than U-A and some
> primitive URI matching. What is needed is a way to introduce new criteria
> for matching, like time of day, IP address, and so forth, so that you can
> describe how you want the robot to behave, rather than just outright ban it.

Reinventing robots.txt seems like a misguided effort especially when other
mechanisms already exist.  The conservation of effort principle would suggest
it's easier to get the RSS client developers to follow the specs than it would
be to bastardize the robots.txt mechanism and force a whole other range of
developers to get involved.

The RSS spec already has indicators of how a reader should behave.  The RSS-1.0
module for syndication takes it a step further.  Getting the reader programs to
do this would eliminate a GREAT many of these problems.  If a developer isn't
willing to support the spec then in what world would we ever hope to see them
supporting some twisting of the robots.txt mechanism?  I just don't see it as
being practical or worth the effort.

But above this effort, using the HTTP header negotiation is also a good plan.
The "trouble" as I've seen it is "log watching fetishes".  Using content change
detection still adds a hit to a server log.  This seems to drive some folks to
the brink of insanity.  Ignoring the fact that the hit required little or no
bandwidth, the presense in the log is greatly misinterpreted.  I'd have to
suggest to folks that unless they're going to get into more sophisticated
logging and charting tools that they stop working themselves into a lather over
this.

The next step of this insanity, I suppose, would be a blacklisting service of
abusive IP addresses or client apps. God help us all.

> This is one of the use cases I had in mind for URISpace [1]. I doubt
> whether any robots.txt replacement will get much traction, however,
> because it's so widespread.
> 1. http://www.w3.org/TR/urispace

As you're no doubt aware, the level of developer understanding about caching and
proxing is very low.  It's unfortunate.  But a bit of digging reveals a TON of
work has already been done in these areas.  I'd strongly suggest others read up
on it.

-Bill Kearney