[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: How to scrape?

To: syndication@yahoogroups.com
Subject: Re: [syndication] Re: How to scrape?
From: Ken MacLeod <ken@bitsko.slc.ut.us>
Date: 21 Mar 2001 09:27:44 -0600
In-reply-to: "David Smiley"'s message of "Wed, 21 Mar 2001 13:14:24 -0000"
References: <99a9fg+ad9p@eGroups.com>

"David Smiley" <dsmiley@mitre.org> writes:

> > The way I do it is that I load the text of the page in to memory,
> > than use regular expressions to extract the proper
> > information. Then I just spit it out in RSS.
> > 
> > If you'd like some Tcl code to do this, I can send you some.
> 
> Yet another way is to use an XML parser and XPath.  There is a
> parser that comes with the Resin servlet engine
> (http://www.caucho.com) that parses HTML even though HTML isn't
> proper XML, or SGML for that matter.  The servlet engine includes an
> XPath library.

HTML Tidy[1] will also let you work with any XPath implementation.
Overall, I highly suggest XPath, once you've loaded the HTML document,
XPath lets you get right to the element value you want:
/HTML/BODY/TABLE/TR[3]/TD[1]

Another entirely different solution is Pyxie[2] and it's file/stream
format PYX.  Using any Pyxie HTML parser, you can convert HTML into a
drop-dead simple, line-oriented format that's really easy to process
with Unix-like filters, such as grep, awk, sed, Perl, and shell.

  -- Ken

[1] <http://www.w3.org/People/Raggett/tidy/>
[2] <http://Pyxie.org/>

Follow-Ups:
- Re: How to scrape?
  - From: Aaron Swartz <aswartz@swartzfam.com>

References:
- Re: How to scrape?
  - From: "David Smiley" <dsmiley@mitre.org>

Prev by Date: Re: How to scrape?
Next by Date: RSS over SOAP
Previous by thread: Re: How to scrape?
Next by thread: Re: How to scrape?
Index(es):
- Date
- Thread