[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: Guaranteed unique?

To: syndication@yahoogroups.com, pavka14@yahoo.com
Subject: Re: [syndication] Re: Guaranteed unique?
From: Rick Bradley <roundeye@roundeye.net>
Date: Sat, 10 Jan 2004 12:21:58 -0500
In-reply-to: <btoucc+kbnf@eGroups.com>
References: <knKgt+PAV8$$EAqP@jblaptop.voidstar.com> <btoucc+kbnf@eGroups.com>
User-agent: Mutt/1.5.4i

* pavka14 (pavka14@yahoo.com) [040110 08:25]:
> Well, yes, this is the best idea and the first that came to mind, but
> we have an application that's been putting news texts in the database
> for four years now, there's a couple of million rows, and many of them
> are not unique because the initial design has not foreseen this.
> So we are basically stuck, and will need a week of programming to
> change the DB design and are looking for some workaround at this
> point, even though we realize it's a patch and should not be done this
> way. 

I don't know how much work you're willing to do on this, but here's one
way to handle it.  This took us an afternoon to implement and it worked
well for us up through 1.8M records (and should work well beyond that,
as we coupled this with a search system which uses similar techniques to
provide file-based searches):


Jump outside the database and use a filesystem-based check for hashes.

I ran across a very similar situation to yours 6 months ago, where the
database was poorly designed and loads just kept increasing due to
constant queries (fwiw, this was a MySQL database, which frankly doesn't
scale anywhere near as well as many other database engines (even
PostgreSQL) under real loads).  A stopgap was needed until that system
could be redesigned, which has come to pass now.  We needed to know if a
data record without an unique index (and hence no unique index in the
database table) had been seen recently, without increasing the database
load.

What we did was to create an md5 hash of the record (thereby giving us
an unique key for the record), and store the hash in a file along with
the current timestamp.  When we receive a new record we compute its
hash, find the corresponding file, and check if the hash has been seen
in the past few days (based on timestamp).  If so we drop the record as
a duplicate, otherwise we do the database insert.

The files are organized into directories, based upon the first few
characters of the md5 hashes they store -- the topmost directory
corresponding to the first character, the next lower directory
corresponding to the second character, and so on.

Within the lowest directory there is a file for the nth hash character.

For example, using a hash depth of 2 we would have a first level
directory listing like::

./0
./1
./2
[...]
./e
./f

The second level under 'f' would be:

./f/0
./f/1
./f/2
[...]
./f/e
./f/f

And in './f/f' would be the files:

./f/f/0
./f/f/1
./f/f/2
[...]
./f/f/f

The hash "d41d8cd98f00b204e9800998ecf8427e" would be found (if present)
in the file:

./d/4/1

This file might contain the following keys and (epoch) timestamps:

d410f760bc615bf27701adc6256f4944:1063633755
d4185828157438e15c9e5aa2b1cf180d:1063642700
d417c239fb9cde594d7be92a82e5e1d6:1063649031
d41912e388e2216d47b744b56b6e20d8:1063650253
d41d1b2c9b0150c3423a028e0c7b8b5a:1063655499
d4158382fa6855b46ef13e4e063ec96d:1063726185
d41a0bf73e740593fcca856d2137b230:1063746610
d417b35d26151b58525b182bcd869ea0:1063829064
d41f53b8b4ad7c20a86079e5736d073d:1063899033
d41b4354a404e8e058c812be1c201433:1063922149
[...]

What we actually did was to serialize a hash of md5's and timestamps to
the file.  Loading in and unserializing the file gives us a data
structure we can manipulate like an array.  Serializing the array and
writing it back updates the hash file.  We prune the array of old
records before writing it back.  The final part of the system is to use
an advisory lock on the hash files to ensure that they don't get
corrupted.

The two source code files of relevance are attached (lets see how well
Yahoo deals with that) to this message.  These are written in PHP and
released under the GPL.  Clearly this algorithm is trivial to port to
another language.

Under experimentation we've found that PHP on a relatively modern system
(PIII-666 + 1GB RAM) can handle instantaneous operations on hash files
of <= ~64KB in size.  Setting the depth of the directory tree deeper
will reduce the size of the files in question, but setting it too deep
will eventually reduce file sizes below the minimum block size for the
filesystem, wasting blocks/inodes.  Try leaving the depth at "2" for
starters.  If your files get too large try a depth of "3", etc.

Rick
-- 
 http://www.rickbradley.com    MUPRN: 818
                       |  a different
   random email haiku  |  direction away from the
                       |  center post/wire.

Attachment: binNeHqyFWfyu.bin
Description: application/ygp-stripped

Attachment: binE2sMLXQMAQ.bin
Description: application/ygp-stripped

Follow-Ups:
- problem solved
  - From: "pavka14" <pavka14@yahoo.com>

References:
- Re: [syndication] Guaranteed unique?
  - From: Julian Bond <julian_bond@voidstar.com>
- Re: Guaranteed unique?
  - From: "pavka14" <pavka14@yahoo.com>

Prev by Date: Re: [syndication] Adding feeds - RSS 0.91 to and HTML document
Next by Date: Yahoo beta does RSS
Previous by thread: Re: Guaranteed unique?
Next by thread: problem solved
Index(es):
- Date
- Thread