[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Translate non-structured documents into Xml RSS format



--- In syndication@egroups.com, Dan Lyke <danlyke@f...> wrote:
> I've written a couple of scrapers, my experience is that ones that
> parse the HTML structure are both harder to write and more fragile
> than ones that just apply regexps. Most of mine were for mining book
> data from the online stores, and once I found the title it was 
fairly
> easy to make things that looked for ISBNs and dollar amounts and
> authors and such, and the difficulty was finding the right title.

At Growing Lifestyle (http://www.growinglifestyle.com/) we scrape 
around 100 home and garden sites for articles, and convert this into 
thousands of customised RSS feeds (basically permutations and 
combinations of source and topic in a large ontology).  We do the 
same thing at our Australian Taxation site 
(http://www.gststartup.com/) and our new eBusiness site 
(http://www.growingresults.com/ - barely opened, so nothing much 
there yet).

We do the conversion from source to RSS in a number of steps.

Step 1: Find suitable articles

This is a combination of spidering and scraping.

On some sites, the new articles are listed on a single page.  In 
other cases, we must traverse a series of pages to find new 
articles.  On particularly tricky sites, the new articles are listed 
on new pages, so first we must find this new page (typically a URL 
that is a function of the date) before we can scrape for links.

Our in-house developed tools take an XML description, specifying the 
spidering operations, where on the page to find the links (eg before 
or after other features), what suitable links look like (eg which 
directories contain articles), and what links to avoid (also 
supplemented with a database of know "bad" links).  We can also apply 
functions to the link, say to convert it to the "print friendly" 
version.  Perl-style regexps are available at all stages if required, 
but mostly "a.?b" and "a.*b" are as complicated as the patterns need 
to get.

We can also go where other robots fear to tread (eg by filling out 
forms, using databases, iterating etc).  And we can get links from 
sources other than web pages - RSS files, email newsletters etc.

We record the "link text" used for each link.  This is often, but not 
always, a good title.  See (3) below.

We also get notifications if the quantity of articles found varies 
outside of normal limits, or certain features are not found.  In 
practice, this rarely occurs because of page redesign.

Unfortunately at this stage we do not have enough to produce a good 
RSS file - we usually lack a good description.

Step 2: Fetching

The links detected in phase 1 are fetched, and some basic integrity 
tests are performed ("Page Not Found" without status = 404, mal-
formed HTML etc).

Step 3: Title determination

Creating a good title is actually very difficult for most sites, and 
something that we consider to be a real competitive advantage of our 
system.

Again, we use in-house tools with an XML description, but in practice 
most sites can use a standard XML description.

Basically the description gives a series of places the title might 
occur.  It might be in the TITLE tag (that would be nice), or it 
might be in the H2 tag.  It might also be immediately after the 3rd 
usage of pixel.gif. On some sites, the title is accompanied with 
standard text such as the site name.

We also use some much more sophisticated methods for finding titles 
that I will not describe in detail.  Suffice it to say that titles 
are usually next to content, above content, and in a prominent style.

The result is a series of title candidates.  The candidates are 
assigned probabilities based on:

- previous titles from that site (eg if 99% of the titles previously 
found were from the TITLE tag, then this is a pretty good guess this 
time).

- similarity to the link text found in item 1 (perhaps multiple 
different link texts from different sources).

Although our titles at this stage are very good (much better than 
most search engines), we still choose to pass the titles by a human 
editor.  It is surprising how easy it is to spot bad titles in a big 
list of titles.  We only publish around 1000 new links per week - 
typically less than 1 hour to check and correct.  The only real 
effort is in training for a new site.

Step 4: Generate a good summary

Once again we use in-house tools and an XML config file, with most 
sites able to use a standard config file.

Although it would be nice to use the meta-description or the first 
200 characters of text, this is almost always unsatisfactory in our 
option.  Just look at the page descriptions on any robotic search 
engine (Altavista, Google etc).

We first try to ignore the navigation features of the page.  
Navigation sections have a high link density, and have high degrees 
of commonality with other pages.

Then we look for the title.  Usually the title is immediately above 
the content.

Content also has characteristics the distinguish it from the 
navigational fluff on the page.  Content has "sentances".  Most of 
the words are not links.  It has "paragraphs".  It is big.  It 
usually is in a different table cell to other parts of the page.

Some sites even provide helpful comments in the HTML that indicate 
where the fluff ends and the content begins.

Once again we arrive at a series of candidates for the summary, with 
probabilities.

After training, it is pretty good.  But we don't yet have this aspect 
good enough to fully automate - it's much better than most search 
engines, but perhaps only 90% of pages get a good automatic 
description.

We choose to manually check all summaries.  Since we have a series of 
likely candidates, it is fairly fast.  With sufficient coffee, 
several hundred descriptions per hour is possible from an operator, 
since it is mostly multiple choice with (a) being the most likely 
answer.

Step 5: Classification

Using a range of standard text classification algorithms we attempt 
to classify the link in our subject ontology.  We are currently 
having most success with a modified naive Bayesian classifier, but a 
number of others have been tried.

The full text of the article is used, but with weighting:
- content has a higher weight than navigation (see step 4)
- the summary has a higher weight than the content
- the title has a higher weight than the summary
- the link text is also used

We supplement this classification with:
- knowledge of the ontology hierarchy
- knowledge of previous classifications from the same link source
- knowledge of previous classifications from that site and directory

And we periodically supplement the automatic classification with 
manual classification and reclassification.  The more that gets 
classified, the better the future classifications become.

In practice, some subjects work very well, and others are not 
satisfactorily classified by automatic means.

Step 6: Publish

This is the easy step.  We pull the data from our database, and 
generate RSS files.  Permuations of source site, subject and some 
other criteria makes for thousands of RSS files.  See the bottom of 
almost every page for the associated RSS feed.

Unlike most news sites, we do not attempt to provide real-time 
feeds.  We currently update roughly weekly, which gives us a chance 
to process links in batches, which is much more efficient of the 
operator's time.  New gardening links tend to have lasting value, 
unlike news articles that become almost worthless if they are delayed 
by 24 hours.  Most of the value is in the summarisation and 
classification, rather than the "newness" of the link.


I hope your found this description of interest.  I think it is "state 
of the art" in scraping technology, but it is really just the 
beginning.  There is so much scope for improvement in our harnessing 
of meta-data about the web.

Steve