[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Robot Discovery



Hey Julian,

It's definately not new; it's the approach of robots.txt, as you
mention, as well as P3P's "well-known location", /w3c/p3p.xml. 

The problem is that it's based on a site being the unit of interest;
i.e., sites that have a large number of users (geocities, etc.) don't
have a means of subdividing the site into appropriate chunks.
Conversely, an entity with a large number of sites (e.g., AT&T) might
need a centralised metadata registry (and metadata is the way to look
at this; i.e., rssfile=/news.rss). I spent some time working on a
general way to apply metadata to different parts of Web sites, which
ended up as URISpace[1] - I'd be curious to hear your thoughts about
it in this context.

There's been a fair amount of work on resource and metadata discovery
in different contexts, but nothing Web-like has really shown up yet.
Discovery/Directories are squarely on the roadmap for the Web, but
they're still fairly far off, AFAIK.

Some people grumble that well-known locations 'bless' a particular
URI as special, which is (according to them) bad, but P3P seems to be
getting through the W3C despite this. Web Services also throw a bit
of a loop into things, especially with things like UDDI.

I need to clean the kitchen and go to bed, as I'm rambling now, but
generally, if architected as a generic site metadata binding
mechanism, this could be a good approach. It doesn't solve all of the
problems (and one issue is keeping the site and the metadata in sync,
if they're separate, as well as managing the metadata in general),
but it's a useful tool. 

There seem to be more ideas flying around here lately... will be
interesting to see where they stick.

Cheers and g'night (even tho it's morning there),


1. http://www.w3.org/TR/urispace.html


On Wed, Oct 03, 2001 at 07:38:57AM +0100, Julian Bond wrote:
> This is not thought through but bear with me. It's a response to one of
> the ideas floating round here about a consistent way to discover if a
> site publishes RSS. As we were talking about this it occurred to me that
> this problem is not limited to RSS. A site might well have many XML
> based files available. It might also publish many XML based web
> services. At the moment, the emphasis is all on aggregators and indexers
> trying to locate these and the builders to promote them. Perhaps a
> standard way for builders to publish their existence would turn this on
> it's head. 
> 
> Imagine a discovery.xml somewhat similar to a robots.txt. This would be
> a single file in the root of the website that listed all the xml
> available at that site. Each entry would consist of a single parameter,
> being the URL of the xml service, or the URL of a deeper list. A spider
> reading this would then have to look at each one to determine it's type
> and perhaps use that to go off and look further. So we might have:- 
> discovery.xml => mainnews.rss
>               => subscriptions.opml
>               => sitemeta.dc
>               => feedlist.ocs => subcategory.rss
>               => servicelist.wsdl => getstockquote
> 
> Now like all standards, for this to work it would need very widespread
> implementation. I suspect there are plenty (ie >1) of potential formats
> already available. I can also see problems where the individual entries
> are not single files, but cgi with multiple parameters.
> 
> Is this something new? Or am I just re-hashing work that's already under
> way?
> 
> -- 
> Julian Bond    email: julian_bond@voidstar.com
> CV/Resume:         http://www.voidstar.com/cv/
> WebLog:               http://www.voidstar.com/
> HomeURL:      http://www.shockwav.demon.co.uk/ 
> M: +44 (0)77 5907 2173  T: +44 (0)192 0412 433
> ICQ:33679568 tag:So many words, so little time
> 
>  
> 
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 
> 
> 

-- 
Mark Nottingham
http://www.mnot.net/