[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
New version of ext_aliases.xml file is available...
- Subject: New version of ext_aliases.xml file is available...
- From: "Carmen" <chv@xxxxxxxxx.xxxx
- Date: Mon, 4 Oct 1999 21:07:37 -0700
Hello Everyone,
I have just updated the ext_aliases.xml file to reflect changes
and additions to various syndication lists. The file has grown
from 1517 to 1694 lines, and the number of aliases has grown from
296 to 328.
The file can be found at http://www.vertexdev.com/ext_aliases.xml
More information on the aliases file can be found in the attached
message. Basically, the file exists to make it possible to
recognized aliased syndication URLS that ultimately refer to
the same root data source.
If you are using this file, or if you have plans to use it,
I would really like to hear back from you. The file is a
natural byproduct of my work on Headline Viewer, but I do want
to know that what I do is of use to someone besides my own
users.
Enjoy,
Carmen
Try Headline Viewer at http://www.vertexdev.com/HeadlineViewer
Hello Everyone,
I just released version 0.8.9 of Headline Viewer. One of the
things that I fixed was what I call the "Aliasing Problem."
Basically, I found that if I take the various provider lists
that are built in and that other sites make available for
downloads, I end up with a lot of duplicates even if I
filter out sites with identical URLs to their content.
One reason for the problem is that there are both direct
and indirect routes to the data. Also, some sites are
sydnicating the same content under different URLs. Sometimes
the URLs are wholly different and other times they are just
slightly different.
I tried (and failed) to come up with a fully automated
solution to the problem. There is just no way that a program
can "know" which providers are truly aliases for each other.
So, instead, I came up with the concept of the alias list.
Basically, the alias list (which can be found at
http://www.vertexdev.com/ext_aliases.xml) is a simple XML
file that contains a bunch of entries that look like this:
<alias>
<url>http://www.bized.ac.uk/roads/cgi-bin/new2rdf.pl</url>
<url>http://www.bized.ac.uk/roads/cgi-bin/new2rdf.pl?mode=rss</url>
<url>http://www.bized.ac.uk/roads/ns_channel.rdf</url>
<url>http://theweb.startshere.net/channels/157/RSS91.XML</url>
</alias>
Each of the URLs in a single <alias> are effectively aliases for
the sibling URLs, and I consider the siblings as duplicates. I
built the alias list by hand (it currently has 296 entries) this
past weekend. I will be updating this file on a regular basis. I
may add a "mod date" or a serial number to the top level <aliases>
node.
Right now the XML is really simple. It has no namespace specification.
If anyone wants to suggest changes that will make it look more
"official", go right ahead. I very specifically did not encode
the notion that any particular URL is "better" or "canonical".
Its first come, first serve in Headline Viewer. Whichever URL
is seen first is the one that the user will see. It would be
fine for an application to choose the URL with the best information
quality (e.g. the Fat Scripting New format) but that is a different
issue.
This is definitely an imperfect solution. I wanted to solve the
problem from the viewpoint of a Headline Viewer user. They don't
want to see duplicates.
Please feel free to use this file for your application, to suggest
changes, and to contribute duplicates that you find. I do ask
that you exercise restraint in downloading the file (I am touchy
about this because some !@$^& is polling my site every 5 minutes
for updates and it skews all of my access statistics).
Enjoy,
Carmen
Try Headline Viewer at http://www.vertexdev.com/HeadlineViewer