[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

The Aliasing Problem and my Solution to it...



Hello Everyone,

I just released version 0.8.9 of Headline Viewer. One of the
things that I fixed was what I call the "Aliasing Problem."
Basically, I found that if I take the various provider lists 
that are built in and that other sites make available for 
downloads, I end up with a lot of duplicates even if I 
filter out sites with identical URLs to their content.

One reason for the problem is that there are both direct 
and indirect routes to the data. Also, some sites are
sydnicating the same content under different URLs. Sometimes
the URLs are wholly different and other times they are just
slightly different.

I tried (and failed) to come up with a fully automated
solution to the problem. There is just no way that a program
can "know" which providers are truly aliases for each other.

So, instead, I came up with the concept of the alias list.
Basically, the alias list (which can be found at
http://www.vertexdev.com/ext_aliases.xml) is a simple XML
file that contains a bunch of entries that look like this:

  <alias>
    <url>http://www.bized.ac.uk/roads/cgi-bin/new2rdf.pl</url>
    <url>http://www.bized.ac.uk/roads/cgi-bin/new2rdf.pl?mode=rss</url>
    <url>http://www.bized.ac.uk/roads/ns_channel.rdf</url>
    <url>http://theweb.startshere.net/channels/157/RSS91.XML</url>
  </alias>

Each of the URLs in a single <alias> are effectively aliases for
the sibling URLs, and I consider the siblings as duplicates. I 
built the alias list by hand (it currently has 296 entries) this
past weekend. I will be updating this file on a regular basis. I
may add a "mod date" or a serial number to the top level <aliases>
node. 

Right now the XML is really simple. It has no namespace specification.
If anyone wants to suggest changes that will make it look more
"official", go right ahead. I very specifically did not encode
the notion that any particular URL is "better" or "canonical". 
Its first come, first serve in Headline Viewer. Whichever URL
is seen first is the one that the user will see. It would be
fine for an application to choose the URL with the best information
quality (e.g. the Fat Scripting New format) but that is a different
issue.

This is definitely an imperfect solution. I wanted to solve the
problem from the viewpoint of a Headline Viewer user. They don't
want to see duplicates.

Please feel free to use this file for your application, to suggest
changes, and to contribute duplicates that you find. I do ask
that you exercise restraint in downloading the file (I am touchy
about this because some !@$^& is polling my site every 5 minutes
for updates and it skews all of my access statistics).

Enjoy,

Carmen

Try Headline Viewer at http://www.vertexdev.com/HeadlineViewer