[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RSS 0.91 - Missing Vital Metadata
While RSS 0.91 is extremely powerful, it strikes me as missing two
vital pieces of metadata:
1) Ordering method
2) Categorisation
Ordering method
---------------
RSS defines a list of items, or more specifically an ordered
sequence. But what is the ordering criteria?
Weblogs and news are ordered by time. Most current RSS channels fall
into this category.
Top 10 lists are ordered by a popularity measure. Some examples might
be "Lettermans top 10 reasons for ...", "Top selling CDs", "most
popular pages". There are a sprinkling of these channels.
Other lists are ordered by degree of match. For example the results
of a search might be presented in this manner.
To allow the encoding of this data, I propose the following:
<ordering>time</ordering>
Other values: none, top, match
A simple example. I gather several RSS streams about computer books.
Using this new <ordering> item, I can automatically distinguish "top
books" from "new books". I can merge multiple "new books" streams
together, removing duplicates. On the other hand, I can merge
merge "top books" streams together, weighting elements by duplication
and order within each stream.
Categorisation
--------------
Content aggregators need to be able to categorise their content, or
risk providing extremely long lists of channels (like userland's :-
( ). How is a new user meant to select from a list of 2500 channels,
presented as a flat list?
Unfortunately categorisation is EXTREMELY hard to do across a broad
range of subjects, in a way that suits most people.
Rather than define the one true categorisation schema and taxonomy, I
think we should permit the channel author some flexibility, but still
allow content aggregators some real meat to kickstart their
categorisation.
I propose the following new element, by way of an example for an RSS
channel associated with a book on encryption software:
<category>
<method>yahoo.com</method>
<value>Computers_and_Internet/Internet/World_Wide_Web/Security_and_Enc
ryption</value>
<value>Business_and_Economy/Shopping_and_Services/Books/Booksellers/Co
mputers/Internet/Titles/World_Wide_Web</value>
</category>
<category>
<method>dmoz.org</method>
<value>Computers/Security/Products_and_Tools/Cryptography/</value>
<value>Business/Industries/Publishing/Publishers/Nonfiction/Computers<
/value>
</category>
notes:
1) You can have multiple <value> items in each category.
2) You can have multiple <category> items.
3) Users can define their own methods. Yahoo and DMOZ are
recommended, with DMOZ "more" recommended.
4) the <value> string is a list of "/" seperated values, from
broadest to most specific.
Now the pedantic among you will probably disagree with me as to the
ideal place in Yahoo and DMOZ to categorise this content. But this is
missing the point. The point is that armed with the above data, the
job of classifying this RSS document in any new category tree is made
vastly simpler.
Even if my category tree does not align precisely with Yahoo or DMOZ,
there is going to be some overlap. And the <value> string contains
some good keywords, which I can disambigaute using WordNet or similar
to automatically align with my own arbitrary hierarchy.
For aggregation portals targetting narrow niches, it is a simple job
to find relevant RSS channels using a hand-compiled list of relevant
paths on Yahoo and DMOZ.
The presence of these new items should not upset existing RSS clients
(I hope they have been coded to ignore unknown elements).
Perhaps it would be clearer with an explaination of what I am trying
to do with RSS, and where I have been struggling to apply RSS.
I publish a vertical search portal.
http://www.growinglifestyle.com/h/garden/index.html
Currently it covers 2 topics (more are coming), one of which is
gardening.
I scrape the top gardening web sites for articles (and only the
articles), and assemble it in a categorised hierarchy. So the user
can browse the hierarchy (like Yahoo), or do a full-text search (like
Altavista). But no matter which way they look, they will only get
quality articles on gardening.
What I have just done is add an RSS file at each node (well a few
thousand nodes anyway) on this hierarchy. For example, there is an
RSS file for "gardening", for "plants", for "bulbs" and for "tulips"
(progressively narrower topics). Each of these RSS files is a weblog,
displaying a time ordered sequence of articles being added to the
tree. I'm adding about 1000 articles a week at the top level, so as
you go down the tree the RSS files get quieter until the final nodes
may only get 1 article added per month or two.
Why have I created so many RSS files? Well, not everybody is
interested in everything. You in effect customise the RSS feed that
suits your needs. If all you are interested in are "Dahlias", then
that is all you will get. And publishing it in RSS makes re-purposing
the content so much easier.
Actually, I am even thinking of adding an RSS file for every possible
search phrase. In this case, the RSS file would be ordered by rank
rather than time. Would you subscribe to such an RSS channel?
Probably not as it would not change so often, but you might want to
fetch it on demand. For example, a shopping site might want to
display articles about each of their products. They could create a
unique url containing the keywords and phrases, grab the rss file
corresponding to that search, and display it using an RSS-reading
content module.
So what is the problem?
Well how can I let other sites know the RSS channels I offer?
It is not really a good idea to add several thousand narrow channels
to userland, and then have userland hammer my site every hour.
I could (and am) creating an OCS description, but this does not
describe the categorisation or even the hierarchy. And I run the risk
of having aggregators blindly add every single available RSS channel,
and fetch them all hourly.
I could (and am) entering some of the more generally useful channels
into xmltree and userland. But unless people actually wander all over
my site, they will not be aware of the RSS customisation
possibilities available.
Another problem with content syndication by RSS is as follows. I am
adding around 1000 items per week, with around 1 update event per
week. Ideally it would be better to release the articles in real-
time, but alas it is computationally (and mentally) much less
burdensome to do my processing in batches.
RSS has an implied length of circa 15 items. I know this is not
fixed, but an RSS file with 1000 items is definitely considered
unfriendly (My.Netscape requests file sizes below 8kB). The problem
is that 990 of these new items will never make it onto my 10 element
RSS file. Thus 990 items will miss out on the opportunities for
content syndication and repurposing that RSS allows.
I am still thinking about the best way to solve this last problem.
Some possibilities:
a) Trickle feed the RSS file. Instead of instantly acknowleding the
1000 new articles, the RSS generator could be spoon fed a steady
dribble of articles (say 6 per hour = 1000/week). Clients reading the
RSS file every single hour will get a chance to see all the new
articles.
b) Track the IP address of clients reading the RSS file. Feed each IP
address all the articles added since the last time they read the
file, with some upper limit. So after a few big gulps of new articles
the RSS file will settle down to a list of 10.
Neither of these approaches strikes me as being particularly clean.
I hope this stimulates some discussion about the application of RSS
to search engines, instead of just the traditional areas of blogs and
news feeds.
Steve