[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Translate non-structured documents into Xml RSS format

To: syndication@egroups.com
Subject: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
From: Mark Nottingham <mnot@mnot.net>
Date: Tue, 26 Sep 2000 09:33:02 -0700
In-reply-to: <14800.50488.780295.290786@wynand.flutterby.com>; from danlyke@flutterby.com on Tue, Sep 26, 2000 at 08:48:08AM -0700
References: <8qq038+sp77@eGroups.com> <Pine.SOL.4.21.0009260934330.18255-100000@ic-unix.ic.utoronto.ca> <14800.50488.780295.290786@wynand.flutterby.com>
User-agent: Mutt/1.2i

Interesting; I've been thinking about writing one from this angle for a
while now.

I think part of the advantage to regex's might be that they can be written
to be quite forgiving, and people smart enough to write them generally make
them so. Perhaps a document structure analyser could be given knowledge
about what tags convey structure vs. those that just convey presentation.

This could further be refined by 'teaching' it about several common document
structure patterns to look for.

Just some random thoughts...

On Tue, Sep 26, 2000 at 08:48:08AM -0700, Dan Lyke wrote:
> Ian Graham writes:
> > May scrapers are more sophisticated than that, and actually try and
> > parse the HTML structure, looking for patterns in the hierarchical
> > page structure. 
> 
> I've written a couple of scrapers, my experience is that ones that
> parse the HTML structure are both harder to write and more fragile
> than ones that just apply regexps. Most of mine were for mining book
> data from the online stores, and once I found the title it was fairly
> easy to make things that looked for ISBNs and dollar amounts and
> authors and such, and the difficulty was finding the right title.
> 
> People do tweak HTML and appearance, amazingly they tend to tweak the
> language of the page and the structure of the language less.
> 
> Dan
> 
> 
> 

-- 
Mark Nottingham
http://www.mnot.net/

References:
- Re: Translate non-structured documents into Xml RSS format
  - From: ben@ubiquick.com
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Ian Graham <ian.graham@utoronto.ca>
- Re: [syndication] Re: Translate non-structured documents into Xml RSS format
  - From: Dan Lyke <danlyke@flutterby.com>

Prev by Date: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by Date: Re: Digest Number 130
Previous by thread: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Next by thread: Re: [syndication] Re: Translate non-structured documents into Xml RSS format
Index(es):
- Date
- Thread