[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] shared feed lists
* Dave Winer (dave@userland.com) [031014 08:54]:
> First let's take out the emotionally charged words, blindly, waste, clog up,
> etc.
>
> Do the math. I answered this question in the Q&A. I don't know how to answer
> it again without just repeating the answer.
>
> But let's try anyway. ;->
>
> Assume you look for a link to the directory file in the HTML of the home
> page of the site.
>
> To find the directory, you:
>
> 1. Read the index file.
>
> 2. Look for the link element.
>
> 3. Read the directory file it points to.
>
> In the approach I'm advocating you:
>
> 1. Read the directory file.
>
> Now please explain why is the first approach more efficient.
Let me pose a question which may help illuminate why I think some people
(including myself) find your proposal far less than satisfactory:
How did you, the searcher for feeds, reach the site in question?
I suggest that there are 3 possible answers to this question:
(a) you traversed a link from a document written in some parseable
structured format
(b) you're just sweeping through IP addresses trying to find what's
on the other side
(c) you were given an URL out-of-band (say in an email, over the
phone, or otherwise), i.e., not derived from parsing a structured
document.
I'm going to posit first that case (b) is not one which should be
driving our concerns here. Most users are not sweeping IP space for
info, and to those who are: tough. It should continue to be our goal
to architect information systems from which we can derive meaning, not
to build for the mindless slurping of data by those refusing to use the
powerful tools already widely deployed. By stepping outside the realm
of structured protocols and dialects they are making a choice against
extracting powerful meaning from elaborate structure, and there's no
pragmatic reason to consider their further ease of use.
Then I'm going to posit that case (c) will be in the minority of uses,
and has its own use cases. It should be apparent that most web accesses
for the foreseeable future are going to be from case (a), due to the
relative difficulty in collecting addresses for case (c) document
retrieval. The difference in use case however is this: when accessing
a document from a URL obtained out-of-band it is more than reasonable to
expect that the accessor has no idea what documents may lie at that
site, nor what contents they may contain. The accessor must retrieve at
least one document, possibly many, in order to determine what useful
information may lie on the other side. Some possibilities include:
- /
- robots.txt
- favicon.ico
- index.html
- default.opml
Of these 4 the only document which can reasonably be expected to exist
in most cases is '/'. For the other documents there are questions and
assumptions built-in:
- Assumption: This site is likely to provide the sort of content I'm
after.
- Question: How should the URL be modified to maximize the likelihood
of finding this document?
- Question/Assumption: Not finding the document at this URL means it's
not there(?), or perhaps it's likely I can munge the URL and find it
elsewhere.
- Assumption: Not finding this document means that {site doesn't
support foo, some default policy prevails, etc.}.
- Assumption: This URL represents a single "site" with some coherent
policy enforced/enabled by its owner.
In short, when dealing with case (c) usage, the accessor of documents
doesn't know much and:
(1) is a robot blindly stumbling around scooping up everything in its
path, making sure that it doesn't chew up bandwidth unnecessarily
(think of this as the "robots.txt grandfather clause")
(2) will access '/' and look for structured information which will
tell the accessor where it's likely to find specific resources related
to the specific URL it accessed
(3) will stumble around in the dark rudely poking at likely non-existent
documents that may have only passing relevance to the resources at the
original URL.
Now, why a non-stumbling-robot would argue that we should be designing
for behavior (3) instead of behavior (2) is presumably a matter for the
neurologists, which is not my area of specialization.
Going back to our original 3 cases (a, b, and c); the remaining case of
concern, "b", comprises the bulk of non-stumbling-robot document
accesses. Accessors are coming in via a link from a structured
document. The only non-moronic thing to do is to load the document at
the end of the URL you pulled from the parsed document. If you find
resources in there that are of interest, then they are almost certainly
highly related to the document in choice.
If instead, no such resources are found:
- They probably don't exist, since the author of the document in
question would be the person most expert in the field of knowing
where such resources are, and since they didn't tell you within the
document, it stands to reason that such doesn't exist -- or:
- The author doesn't want you to know where such resources are located.
It is possible that any such resources are unrelated to the document
you're viewing.
- It would be moronic, in just the same way argued above, to then begin
drunkenly stumbling around the webserver racking up 404's to find
something likely not to be there -- or likely not to be relevant.
Dave, you're arguing that we design a protocol for robots or stumbling
morons, when everyone else agrees that a protocol usable by even the
below average would suffice.
On an earlier Web where every site was (at least assumed to be)
monolithic, topical, and had a single owner, stuffing a file with a
known name somewhere around what a consensus agrees looks most like '/'
might have been a reasonable idea. I was sick of that version of the
Web back in 1997, and fortunately so were a lot of other people.
I've got documents on the Web now that were created before there was a
Web to put them on (literally). I threw out Commodore VIC-20 disks
three years ago, but I'm also bringing more and more data online and I'm
not creating a new "web site" every time I create a new archive of
information. Barring catastrophic data loss (knock on wood) I expect to
only add to my online collection, perhaps automatically reformatting
documents occasionally.
New document formats and standards are making it possible for us to
conceptually group documents and place things like RSS feeds at
arbitrary but useful points throughout the collection (it's my belief
that moving to syncato-like systems will expand that flexibility even
further).
Creating "lists of feeds" or "feeds of feeds" is an idea I think we can
all get behind; but if 5 years from now as I'm loading in a new set of
video archives I notice that 50% of my 404's are from tools implementing
this dumbass idea that Dave Winer pushed back in 2003, the 6amwakt [0]
is gonna role and administer a well-deserved cockpunch.
[0] http://www.rickbradley.com/tour/
(I'll wait for thread death before updating Dave's supporting
documentation)
Rick
--
http://www.rickbradley.com MUPRN: 424
| on the intake
random email haiku | manifold. If the gasket
| overhangs trim it.