Re: Large Database Design Help

Matthew Nuzum <mattnuzum@xxxxxxxxx> · Thu, 9 Feb 2006 15:49:56 -0600

On 2/9/06, Orion Henry <lab@xxxxxxxxxxxxxx> wrote:
>
> Hello All,
>
> I've inherited a postgresql database that I would like to refactor.   It
> was origionally designed for Postgres 7.0 on a PIII 500Mhz and some
> design decisions were made that don't make sense any more.  Here's the
> problem:
>
> 1) The database is very large, the largest table has 40 million tuples.
>
> 2) The database needs to import 10's of thousands of tuples each night
> quickly.  The current method is VERY slow.
>
> 3) I can't import new records with a COPY or drop my indexes b/c some of
> them are new records (INSERTS) and some are altered records (UPDATES)
> and the only way I can think of to identify these records is to perform
> a select for each record.

 [snip]
>
> 3) The current code that bulk loads data into the database is a loop
> that looks like this:
>
>                 $result = exe("INSERT INTO $table ($name_str) SELECT
> $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)");
>                 if ($result == 0)
>                 {
>                         $result = exe("UPDATE $table SET $non_keys WHERE
> $keys");
>                 }
>
> Is there a faster way to bulk load data when it's not known ahead of
> time if it's a new record or an updated record?

I experimented with something like this and I was able to successively
decrease the amount of time needed with an import. The final solution
that took my import down from aproximately 24 hours to about 30 min
was to use a C#/Java hashtable or a python dictionary. For example,
the unique data in one particular table was "User_Agent" so I made it
the key in my hashtable. I actually added a method to the hashtable so
that when I added a new record to the hashtable it would do the insert
into the db.

The downside to this is that it used *GOBS* of RAM.

Using Python, I was able to dramatically decrease the ram usage by
switching to a GDB based dictionary instead of the standard
dictionary. It only increased the time by about 50% so the total
processing time was about 45 min vs the previous 30 min.

I only had about 35 million records and my technique was getting to
the point where it was unweldy, so with your 40 million and counting
records you would probably want to start with the GDB technique unless
you have a ton of available ram.

You might interpret this as being a knock against PostgreSQL since I
pulled the data out of the db, but it's not; You'd be hard pressed to
find anything as fast as the in-memory hashtable or the on disk GDB;
however it's usefullness is very limited and for anything more complex
than just key=>value lookups moving to PostgreSQL is likely a big win.

--
Matthew Nuzum
www.bearfruit.org