[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How to scrape?
> The way I do it is that I load the text of the page in to memory,
than use
> regular expressions to extract the proper information. Then I just
spit it
> out in RSS.
>
> If you'd like some Tcl code to do this, I can send you some.
Yet another way is to use an XML parser and XPath. There is a parser
that comes with the Resin servlet engine (http://www.caucho.com) that
parses HTML even though HTML isn't proper XML, or SGML for that
matter. The servlet engine includes an XPath library.
-- David Smiley