On 2/9/06, Orion Henry <lab@xxxxxxxxxxxxxx> wrote: > > Hello All, > > I've inherited a postgresql database that I would like to refactor. It > was origionally designed for Postgres 7.0 on a PIII 500Mhz and some > design decisions were made that don't make sense any more. Here's the > problem: > > 1) The database is very large, the largest table has 40 million tuples. > > 2) The database needs to import 10's of thousands of tuples each night > quickly. The current method is VERY slow. > > 3) I can't import new records with a COPY or drop my indexes b/c some of > them are new records (INSERTS) and some are altered records (UPDATES) > and the only way I can think of to identify these records is to perform > a select for each record. [snip] > > 3) The current code that bulk loads data into the database is a loop > that looks like this: > > $result = exe("INSERT INTO $table ($name_str) SELECT > $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)"); > if ($result == 0) > { > $result = exe("UPDATE $table SET $non_keys WHERE > $keys"); > } > > Is there a faster way to bulk load data when it's not known ahead of > time if it's a new record or an updated record? I experimented with something like this and I was able to successively decrease the amount of time needed with an import. The final solution that took my import down from aproximately 24 hours to about 30 min was to use a C#/Java hashtable or a python dictionary. For example, the unique data in one particular table was "User_Agent" so I made it the key in my hashtable. I actually added a method to the hashtable so that when I added a new record to the hashtable it would do the insert into the db. The downside to this is that it used *GOBS* of RAM. Using Python, I was able to dramatically decrease the ram usage by switching to a GDB based dictionary instead of the standard dictionary. It only increased the time by about 50% so the total processing time was about 45 min vs the previous 30 min. I only had about 35 million records and my technique was getting to the point where it was unweldy, so with your 40 million and counting records you would probably want to start with the GDB technique unless you have a ton of available ram. You might interpret this as being a knock against PostgreSQL since I pulled the data out of the db, but it's not; You'd be hard pressed to find anything as fast as the in-memory hashtable or the on disk GDB; however it's usefullness is very limited and for anything more complex than just key=>value lookups moving to PostgreSQL is likely a big win. -- Matthew Nuzum www.bearfruit.org