xpath2rss

Icon  Name                                     Description
[   ] CHANGES release changelog [   ] xpath2rss-0.2.tgz alpha [   ] xpath2rss-0.3.tgz alpha [   ] xpath2rss-0.5.tgz alpha [   ] xpath2rss-0.6.tgz alpha [   ] xpath2rss-0.7.tgz latest release
xpath2rss 0.7
by Mark Nottingham <mnot@pobox.com>


INTRODUCTION
------------
xpath2rss is Yet Another HTML->RSS scraper. This one's different in that
instead of using regular expressions, as most do, it uses XPath.

Why XPath?

The first reason to use XPATH is to learn it; XPath is an excellent tool for
working with XML. Currently, it's used in XSLT and not too many other
places.

Secondly, XPath is smarter than regex because XPath is aware of XML's (and
therefore HTML's) syntax; this makes it a more natural fit for scraping HTML
and XML, and should make it more reliable.


REQUIREMENTS
------------
xpath2rss requires:
* Python 2.2 or newer  <http://www.python.org/>
* PyXML 0.7.1 or newer <http://pyxml.sf.net/>

xpath2rss has been tested on Linux, but should work on other platforms.
Installing Python and the required extensions are easy - see the included
instructions in each package.


CONFIGURATION
-------------
To use xpath2rss, you'll first need to create an XML file that contains the
information needed to fetch channels, including the XPATH expressions. For
an example, see the config.xml file.

** PLEASE do not use the example file in production; it points to real sites
** which would be inconvenienced by a large increase in load. It is included
** as a reference only.

The config file is XML, consisting of a root element, 'channels', in the
'http://www.mnot.net/xpath2rss/' namespace. It requires one attribute,
'output_base', which is a local directory which output may be written to.
'channels' may have any number of 'channel' children.

'channel' elements (in the same namespace) embody the individual channels
to be scraped. Each may have the following attributes;
  - 'id': token;  a textual identifier for the channel; should only contain
    alphanumeric characters and any of '-' or '_' (REQUIRED)
  - 'reverse': boolean; if '1', the order of the items in the channel
    will be reversed before being written to RSS.
  - 'rm_args': space-separated list of tokens; if present, matching URI
    query arguments will be removed from the rss:link before being
    written to RSS.

'channel' elements may contain children in arbitrary namespaces; these
elements will be used as metadata children in the RSS representation. For
example, if the configuration file contains:

  <channel id="test" xmlns="http://www.mnot.net/xpath2rss/"
                     xmlns:rss="http://purl.org/rss/1.0/">
    <rss:link>http://www.example.org/</rss:link>
    <rss:title>This is a test</rss:title>
  </channel>
  
the RSS will contain:

  <rss:channel xmlns:rss="http://purl.org/rss/1.0/">
    <rss:link>http://www.example.org/</rss:link>
    <rss:title>This is a test</rss:title>
  </rss:channel>
  
Note that the namespace of the elements in the configuration file are 
preserved, so that arbitrary metadata can be introduced. Channels MUST
have an 'rss:link' element, and SHOULD have an 'rss:title' element, where 
the 'rss' prefix is declared as 'http://purl.org/rss/1.0/'.

Each 'channel' element MUST have an 'item' child element in the same
namespace. 'item' elements contain the XPath patterns used to scrape the 
channel. They MUST have a 'path' attribute, which contains the XPath used
to isolate items from the channel.

The children of 'item' are used to populate the items. Each MUST have a
single attribute, 'path', which contains the XPath expression used to find
the desired information. For example, if configuration contains;

  <channel id="test" xmlns="http://www.mnot.net/xpath2rss/">
                    xmlns:rss="http://purl.org/rss/1.0/">
    <rss:link>http://www.example.org</rss:link>
    <item path="//UL/LI/A">
      <rss:title path="descendant::text()" />
      <rss:link path="@HREF" />
    </item>
  </channel>

xpath2rss would find all of the matches for '//UL/LI/A' at
'http://www.example.org' and use them as items; for each item, it would use
the contained text as the rss:title, and the href attribute as the rss:link.

This isn't as complex as it sounds; see config.xml.


WRITING XPATH EXPRESSIONS
-------------------------
When authoring the XPATH expressions for a channel, these tips should be
useful:
  - HTML elements and attributes are normalized to uppercase; therefore,
    your expressions should always use uppercase for them. However,

  - XHTML elements and attributes are always lowercase. As a result, 
    expressions should use lowercase element and attributes when
    the target resource is XHTML.

  - Try to find some distinguishing attribute of the portion of the source
    you're interested in, and use that to select your items.

  - Failing that, you may be able to use following-sibling and
    preceding-sibling to find the correct position in a document

To help find the correct XPATH expression, try using the included xpathtest
utility.

For more information about XPATH, see:
  http://www.w3.org/TR/xpath.html
  http://www.zvon.org/xxl/XPathTutorial/General/examples.html
  http://www.vbxml.com/xsl/XPathRef.asp


USE
---
Once you've created a configuration file, run:

  % ./xpath2rss config.xml
  
There are a number of command-line options available; run
 
  % ./xpath2rss -h
  
for more information.

xpath2rss is designed to be run as a periodic process, using a utility such
as cron (on unix and unix-like systems). Generally, it is polite to run it
once an hour, at most (it may be beneficial to scrape less frequently
changing resources by placing them in a separate configuration file, which
is run less often).


LICENSE
-------
It's free. You can do what you want with it, except say that you wrote it.
Don't blame me if things go horribly wrong. Some people might not like you
scraping their content, so ask them first. Have fun.