[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
site-wide metadata discovery
Some nice comments on discoverability in general being broken, but that
doesn't really help solve the problem. Here's an off-the-wall idea. What
about adding functionality to a file that's already present, namely the
robots.txt file? As it's tolerated already in many cases, let's make it
useful.
Rather than user-agent/disallow recordsets, it could use something like:
Site-Index:
Public-Feeds: myPublicFeeds.opml
According to the standard, unrecognized headers should be ignored, so this
shouldn't affect any "normal" robot/spider/crawler. But when an app came
along that did recognize this recordset, it could get the data it needs. No
new file name clutter, no link clutter. You could still use those if you
want, of course. :)
Of course, this doesn't help much if you're talking about folder-level data,
since robots.txt exists only at the root of the domain. But at the very
least, the root could be read to determine the file name, and look for that
file name in the current folder.
For instance, if browsing example.com/folder, your browsing application of
choice reads example.com/robots.txt and finds that the public feeds are
stored in myPublicFeeds.opml, so it looks in
example.com/folder/myPublicFeeds.opml for the data. If you want to get data
below or above the current location, apply the same logic - traverse the
folder structure and get the named file.
This might be preferred in some cases, where file names should be
standardized across the domain. In other situations, perhaps an alternative
to allow for differences in folders:
Site-Index: folder
Public-Feeds: myOtherFeeds.opml
Site-Index: another
Public-Feeds: evenMoreFeeds.opml
Or even add include functionality to the file:
Site-Index: include folder/robots.txt
And let each subdivision of the domain create their own file, which gets
read into the "master" as the data is parsed. Naturally this could create a
whole bunch of crawling to get all the data, so this last idea might not be
the best - but it could be there for those who want the functionality at the
cost of the bandwidth/resources required. What's more, the include allows
for different file names in different folders. Only the top-level
robots.txt is "standardized", and that file is already there in most cases.
If these are all really bad ideas, I blame Mark's medicine. :)
---
Chad Everett
yahoogroups@jayseae.cxliv.org