[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RSS Content Serialization and Archival from HTML
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Permalink to this entry: http://www.peerfear.org/rss/permalink/1029646178.shtml
I just finished working on some code that I think is very cool!
Basically it allows user to submit an arbitrary URL and it will produce RSS with
mod_content of the article without any tables or other content which would
obscure the presentation. This strips out active content, large images,
advertisements, tables, etc and just displays the text and formatting.
The output is RSS 1.0 with mod_content encoded HTML
For example:
http://reptile.peerfear.org/reptile/servlet/content/http/sportsillustrated.cnn.com/baseball/news/2002/08/16/strike_date_ap/
Will produce an RSS feed for the HTML at
http://sportsillustrated.cnn.com/baseball/news/2002/08/16/strike_date_ap/
This is all done by using the path info provided within the servlet. It is easy
to generate these URLs. Just take the source URL and replace 'http://' with
'http/' concat it with http://reptile.peerfear.org/reptile/servlet/content/
This works by using a novel algorithm that works on the following assumptions.
- - Assumes that the content is HTML
- - Assumes that the data we are looking for will be about 1 degree from a leaf
node.
- - Assumes that the paragraphs we are looking for will be > 15 characters in
length
- - Assumes that the only nested elements will be 'b', 'i', 'ul', etc
The goal is to produce RSS content that can be used with devices that don't
support large displays. This can also be used to archive RSS presented
articles and to support mod_content for channels that aren't RSS 1.0 or don't
yet support mod_content.
The source is available [1] and can be used standalone from the command line if
necessary (just need to java JDOM and Jakarta Regexp).
I am pretty sure this works with most HTML. If anyone finds any sites that
aren't supported I would appreciate knowing about it. I know for sure it
doesn't work with The Register becuase they are producing really bad HTML.
Anyway. This is going to be used with a Bonita and Zoe (A thin RSS aggregator
I am working on).
In the future I am going to add:
- - The ability to extract title, and description
- - The ability to avoid encoded HTML but valid XHTML mixed content.
1. http://www.openprivacy.org/cgi-bin/cvsweb/cvsweb.cgi/reptile/src/java/org/openprivacy/reptile/ContentServlet.java
- --
Kevin A. Burton ( burton@apache.org, burton@openprivacy.org, burton@peerfear.org )
Location - San Francisco, CA, Cell - 415.595.9965
Jabber - burtonator@jabber.org, Web - http://www.peerfear.org/
GPG fingerprint: 4D20 40A0 C734 307E C7B4 DCAA 0303 3AC5 BD9D 7C4D
IRC - openprojects.net #infoanarchy | #p2p-hackers | #reptile
Soylent Green is made from people!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Get my public key at: http://relativity.yi.org/pgpkey.txt
iD8DBQE9Xy4HAwM6xb2dfE0RAkU6AKCfx4/7meT4opKttCsCFH70JP29RACglrUN
O8azSSBjB665nkol582WMMs=
=fAEL
-----END PGP SIGNATURE-----