Re: Large Database Design Help

Greg Stark <gsstark@xxxxxxx> · 09 Feb 2006 14:45:02 -0500

Orion Henry <lab@xxxxxxxxxxxxxx> writes:

> What  I would LIKE to do but am afraid I will hit a serious performance wall
> (or am missing an obvious / better way to do it)
> 
> 1) Merge all 133 client tables into a single new table, add a client_id column,
> do the data partitioning on the indexes not the tables as seen here:
> 
>   CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo ( order_number )
> WHERE client_id = XXX;
>   CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo ( transaction_date )
> WHERE client_id = XXX;

The advantages to the partitioned scheme are a) you can drop a client quickly
in a single operation b) the indexes are only half as wide since they don't
include client_id and c) you can do a sequential scan of an entire client
without using the index at all.

Unless any of these are overwhelming I would say to go ahead and merge them.
If you frequently scan all the records of a single client or frequently drop
entire clients then the current scheme may be helpful.

>     (Aside question: if I were to find a way to use COPY and I were loading
> data on a single client_id, would dropping just the indexes for that client_id
> accelerate the load?)

Dropping indexes would accelerate the load but unless you're loading a large
number of records relative the current size I'm not sure it would be a win
since you would then have to rebuild the index for the entire segment.

> 2) Find some way to make the bulk loads faster or more efficent (help!)

If your existing data isn't changing while you're doing the load (and if it is
then your existing load process has a race condition btw) then you could do it
in a couple big queries:

COPY ${table}_new FROM '...';
CREATE TABLE ${table}_exists  as SELECT * FROM ${table}_new WHERE EXISTS (select 1 from $table where ${table}_new.key = $table.key);
CREATE TABLE ${table}_insert  as SELECT * FROM ${table}_new WHERE NOT EXISTS (select 1 from $table where ${table}_new.key = $table.key);

UPDATE $table set ... FROM ${table}_exists WHERE ${table}_exists.key = ${table}.key
INSERT INTO $table (select * from ${table}_insert)

actually you could skip the whole ${table_insert} step there and just do the
insert I guess. There are also other approaches you could use like adding a
new column to ${table}_new instead of creating new tables, etc.

> 3) Wrap each load into a transaction ( tens of thousands of records per load )

Yes, Postgres is faster if you do more operations in a single transaction.
Every COMMIT means waiting for an fsync. The only disadvantage to batching
them into a large transaction is if it lasts a *long* time then it could
create problems with your vacuum strategy. Any vacuum that runs while the
transaction is still running won't be able to vacuum anything.

You might consider running VACUUM FULL or CLUSTER on the table when you're
done with the loading process. It will lock the table while it runs though. 

-- 
greg