On Thu, Feb 09, 2006 at 11:07:06AM -0800, Orion Henry wrote: > > Hello All, > > I've inherited a postgresql database that I would like to refactor. It > was origionally designed for Postgres 7.0 on a PIII 500Mhz and some > design decisions were made that don't make sense any more. Here's the > problem: > > 1) The database is very large, the largest table has 40 million tuples. > > 2) The database needs to import 10's of thousands of tuples each night > quickly. The current method is VERY slow. > > 3) I can't import new records with a COPY or drop my indexes b/c some of > them are new records (INSERTS) and some are altered records (UPDATES) > and the only way I can think of to identify these records is to perform > a select for each record. > > Here is how the database is currently laid out and you'll see why I have > a problem with it > > 1) The data is easily partitionable by client ID. In an attempt to keep > the indexes small and the inserts fast one table was made per client > ID. Thus the primary table in the database (the one with 40 million > tuples) is really 133 tables each ending with a three digit suffix. > The largest of these client tables has 8 million of the 40 million > tuples. The system started with around a half dozen clients and is now > a huge pain to manage with so many tables. I was hoping new hardware > and new postgres features would allow for this data to be merged safely > into a single table. If all the clients are equally active then partitioning by client probably makes little sense. If some clients are much more active than others then leaving this as-is could be a pretty big win. If the partitioning is done with either a view and rules or inherited tables and rules it shouldn't be too hard to manage. > 2) The imports are not done inside of transactions. I'm assuming the > system designers excluded this for a reason. Will I run into problems > performing tens of thousands of inserts and updates inside a single > transaction? Never attribute to thoughtful design that which can be fully explained by ignorance. :) I'd bet they just didn't know any better. > 3) The current code that bulk loads data into the database is a loop > that looks like this: > > $result = exe("INSERT INTO $table ($name_str) SELECT > $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)"); > if ($result == 0) > { > $result = exe("UPDATE $table SET $non_keys WHERE > $keys"); > } > > Is there a faster way to bulk load data when it's not known ahead of > time if it's a new record or an updated record? Uuuugly. :) Instead, load everything into a temp table using COPY and then UPDATE real_table ... FROM temp_table t WHERE real_table.key = t.key and INSERT SELECT ... WHERE NOT EXISTS. But take note that this is a race condition so you can only do it if you know nothing else will be inserting into the real table at the same time. You might want to look at the stats-proc code at http://cvs.distributed.net; it does exactly this type of thing. > What I would LIKE to do but am afraid I will hit a serious performance > wall (or am missing an obvious / better way to do it) > > 1) Merge all 133 client tables into a single new table, add a client_id > column, do the data partitioning on the indexes not the tables as seen here: > > CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo ( > order_number ) WHERE client_id = XXX; > CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo ( > transaction_date ) WHERE client_id = XXX; > > (Aside question: if I were to find a way to use COPY and I were > loading data on a single client_id, would dropping just the indexes for > that client_id accelerate the load?) Hrm, I believe it would... > 2) Find some way to make the bulk loads faster or more efficent (help!) Don't do things row-by-row. If you can't ensure that there will be only one process inserting to eliminate the race condition I mentioned above then reply back and I'll point you at code that should still be much faster than what you're doing now. > 3) Wrap each load into a transaction ( tens of thousands of records per > load ) Getting rid of row-by-row will be your biggest win. If you do have to do row-by-row, at least wrap it in a transaction. As long as the transaction doesn't take *too* long it won't be an issue. > Is this a good plan? Is there a better way? Am I walking into a trap? > Should I leave well enough alone and not try and fix something that's > not broken? > > FWIW here's the hardware and the profile of the current uber table: > > Column | Type | Modifiers > -------------------+---------+----------- > order_number | integer | not null > order_line_number | integer | not null > action_number | integer | not null > transaction_date | date | > code | text | > trans_group_code | text | > quantity | integer | > extension | money | > sales_tax | money | > shipping | money | > discount | money | > > Dual Opteron 246, 4 disk SCSI RAID5, 4GB of RAM Remember that the write performance of raid5 normally stinks. > # du -sh /var/lib/postgres/data/ > 16G /var/lib/postgres/data/ > > ( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and > when I do this refactoring ) Going to 8.1 would help in a large number of ways even if you don't refactor. The stats-proc code I mentioned runs 2x faster under 8.1 than it does under 7.4. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@xxxxxxxxxxxxx Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461