[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: How to scrape?



"David Smiley" <dsmiley@mitre.org> writes:

> > The way I do it is that I load the text of the page in to memory,
> > than use regular expressions to extract the proper
> > information. Then I just spit it out in RSS.
> > 
> > If you'd like some Tcl code to do this, I can send you some.
> 
> Yet another way is to use an XML parser and XPath.  There is a
> parser that comes with the Resin servlet engine
> (http://www.caucho.com) that parses HTML even though HTML isn't
> proper XML, or SGML for that matter.  The servlet engine includes an
> XPath library.

HTML Tidy[1] will also let you work with any XPath implementation.
Overall, I highly suggest XPath, once you've loaded the HTML document,
XPath lets you get right to the element value you want:
/HTML/BODY/TABLE/TR[3]/TD[1]

Another entirely different solution is Pyxie[2] and it's file/stream
format PYX.  Using any Pyxie HTML parser, you can convert HTML into a
drop-dead simple, line-oriented format that's really easy to process
with Unix-like filters, such as grep, awk, sed, Perl, and shell.

  -- Ken

[1] <http://www.w3.org/People/Raggett/tidy/>
[2] <http://Pyxie.org/>