[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Translate non-structured documents into Xml RSS format
--- In syndication@egroups.com, Dan Lyke <danlyke@f...> wrote:
> I've written a couple of scrapers, my experience is that ones that
> parse the HTML structure are both harder to write and more fragile
> than ones that just apply regexps. Most of mine were for mining book
> data from the online stores, and once I found the title it was
fairly
> easy to make things that looked for ISBNs and dollar amounts and
> authors and such, and the difficulty was finding the right title.
At Growing Lifestyle (http://www.growinglifestyle.com/) we scrape
around 100 home and garden sites for articles, and convert this into
thousands of customised RSS feeds (basically permutations and
combinations of source and topic in a large ontology). We do the
same thing at our Australian Taxation site
(http://www.gststartup.com/) and our new eBusiness site
(http://www.growingresults.com/ - barely opened, so nothing much
there yet).
We do the conversion from source to RSS in a number of steps.
Step 1: Find suitable articles
This is a combination of spidering and scraping.
On some sites, the new articles are listed on a single page. In
other cases, we must traverse a series of pages to find new
articles. On particularly tricky sites, the new articles are listed
on new pages, so first we must find this new page (typically a URL
that is a function of the date) before we can scrape for links.
Our in-house developed tools take an XML description, specifying the
spidering operations, where on the page to find the links (eg before
or after other features), what suitable links look like (eg which
directories contain articles), and what links to avoid (also
supplemented with a database of know "bad" links). We can also apply
functions to the link, say to convert it to the "print friendly"
version. Perl-style regexps are available at all stages if required,
but mostly "a.?b" and "a.*b" are as complicated as the patterns need
to get.
We can also go where other robots fear to tread (eg by filling out
forms, using databases, iterating etc). And we can get links from
sources other than web pages - RSS files, email newsletters etc.
We record the "link text" used for each link. This is often, but not
always, a good title. See (3) below.
We also get notifications if the quantity of articles found varies
outside of normal limits, or certain features are not found. In
practice, this rarely occurs because of page redesign.
Unfortunately at this stage we do not have enough to produce a good
RSS file - we usually lack a good description.
Step 2: Fetching
The links detected in phase 1 are fetched, and some basic integrity
tests are performed ("Page Not Found" without status = 404, mal-
formed HTML etc).
Step 3: Title determination
Creating a good title is actually very difficult for most sites, and
something that we consider to be a real competitive advantage of our
system.
Again, we use in-house tools with an XML description, but in practice
most sites can use a standard XML description.
Basically the description gives a series of places the title might
occur. It might be in the TITLE tag (that would be nice), or it
might be in the H2 tag. It might also be immediately after the 3rd
usage of pixel.gif. On some sites, the title is accompanied with
standard text such as the site name.
We also use some much more sophisticated methods for finding titles
that I will not describe in detail. Suffice it to say that titles
are usually next to content, above content, and in a prominent style.
The result is a series of title candidates. The candidates are
assigned probabilities based on:
- previous titles from that site (eg if 99% of the titles previously
found were from the TITLE tag, then this is a pretty good guess this
time).
- similarity to the link text found in item 1 (perhaps multiple
different link texts from different sources).
Although our titles at this stage are very good (much better than
most search engines), we still choose to pass the titles by a human
editor. It is surprising how easy it is to spot bad titles in a big
list of titles. We only publish around 1000 new links per week -
typically less than 1 hour to check and correct. The only real
effort is in training for a new site.
Step 4: Generate a good summary
Once again we use in-house tools and an XML config file, with most
sites able to use a standard config file.
Although it would be nice to use the meta-description or the first
200 characters of text, this is almost always unsatisfactory in our
option. Just look at the page descriptions on any robotic search
engine (Altavista, Google etc).
We first try to ignore the navigation features of the page.
Navigation sections have a high link density, and have high degrees
of commonality with other pages.
Then we look for the title. Usually the title is immediately above
the content.
Content also has characteristics the distinguish it from the
navigational fluff on the page. Content has "sentances". Most of
the words are not links. It has "paragraphs". It is big. It
usually is in a different table cell to other parts of the page.
Some sites even provide helpful comments in the HTML that indicate
where the fluff ends and the content begins.
Once again we arrive at a series of candidates for the summary, with
probabilities.
After training, it is pretty good. But we don't yet have this aspect
good enough to fully automate - it's much better than most search
engines, but perhaps only 90% of pages get a good automatic
description.
We choose to manually check all summaries. Since we have a series of
likely candidates, it is fairly fast. With sufficient coffee,
several hundred descriptions per hour is possible from an operator,
since it is mostly multiple choice with (a) being the most likely
answer.
Step 5: Classification
Using a range of standard text classification algorithms we attempt
to classify the link in our subject ontology. We are currently
having most success with a modified naive Bayesian classifier, but a
number of others have been tried.
The full text of the article is used, but with weighting:
- content has a higher weight than navigation (see step 4)
- the summary has a higher weight than the content
- the title has a higher weight than the summary
- the link text is also used
We supplement this classification with:
- knowledge of the ontology hierarchy
- knowledge of previous classifications from the same link source
- knowledge of previous classifications from that site and directory
And we periodically supplement the automatic classification with
manual classification and reclassification. The more that gets
classified, the better the future classifications become.
In practice, some subjects work very well, and others are not
satisfactorily classified by automatic means.
Step 6: Publish
This is the easy step. We pull the data from our database, and
generate RSS files. Permuations of source site, subject and some
other criteria makes for thousands of RSS files. See the bottom of
almost every page for the associated RSS feed.
Unlike most news sites, we do not attempt to provide real-time
feeds. We currently update roughly weekly, which gives us a chance
to process links in batches, which is much more efficient of the
operator's time. New gardening links tend to have lasting value,
unlike news articles that become almost worthless if they are delayed
by 24 hours. Most of the value is in the summarisation and
classification, rather than the "newness" of the link.
I hope your found this description of interest. I think it is "state
of the art" in scraping technology, but it is really just the
beginning. There is so much scope for improvement in our harnessing
of meta-data about the web.
Steve