[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [syndication] Re: How to scrape?
"David Smiley" <dsmiley@mitre.org> writes:
> > The way I do it is that I load the text of the page in to memory,
> > than use regular expressions to extract the proper
> > information. Then I just spit it out in RSS.
> >
> > If you'd like some Tcl code to do this, I can send you some.
>
> Yet another way is to use an XML parser and XPath. There is a
> parser that comes with the Resin servlet engine
> (http://www.caucho.com) that parses HTML even though HTML isn't
> proper XML, or SGML for that matter. The servlet engine includes an
> XPath library.
HTML Tidy[1] will also let you work with any XPath implementation.
Overall, I highly suggest XPath, once you've loaded the HTML document,
XPath lets you get right to the element value you want:
/HTML/BODY/TABLE/TR[3]/TD[1]
Another entirely different solution is Pyxie[2] and it's file/stream
format PYX. Using any Pyxie HTML parser, you can convert HTML into a
drop-dead simple, line-oriented format that's really easy to process
with Unix-like filters, such as grep, awk, sed, Perl, and shell.
-- Ken
[1] <http://www.w3.org/People/Raggett/tidy/>
[2] <http://Pyxie.org/>