Hello All,
I've inherited a postgresql database that I would like to refactor. It
was origionally designed for Postgres 7.0 on a PIII 500Mhz and some
design decisions were made that don't make sense any more. Here's the
problem:
1) The database is very large, the largest table has 40 million tuples.
2) The database needs to import 10's of thousands of tuples each night
quickly. The current method is VERY slow.
3) I can't import new records with a COPY or drop my indexes b/c some of
them are new records (INSERTS) and some are altered records (UPDATES)
and the only way I can think of to identify these records is to perform
a select for each record.
Here is how the database is currently laid out and you'll see why I have
a problem with it
1) The data is easily partitionable by client ID. In an attempt to keep
the indexes small and the inserts fast one table was made per client
ID. Thus the primary table in the database (the one with 40 million
tuples) is really 133 tables each ending with a three digit suffix.
The largest of these client tables has 8 million of the 40 million
tuples. The system started with around a half dozen clients and is now
a huge pain to manage with so many tables. I was hoping new hardware
and new postgres features would allow for this data to be merged safely
into a single table.
2) The imports are not done inside of transactions. I'm assuming the
system designers excluded this for a reason. Will I run into problems
performing tens of thousands of inserts and updates inside a single
transaction?
3) The current code that bulk loads data into the database is a loop
that looks like this:
$result = exe("INSERT INTO $table ($name_str) SELECT
$val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)");
if ($result == 0)
{
$result = exe("UPDATE $table SET $non_keys WHERE
$keys");
}
Is there a faster way to bulk load data when it's not known ahead of
time if it's a new record or an updated record?
What I would LIKE to do but am afraid I will hit a serious performance
wall (or am missing an obvious / better way to do it)
1) Merge all 133 client tables into a single new table, add a client_id
column, do the data partitioning on the indexes not the tables as seen here:
CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo (
order_number ) WHERE client_id = XXX;
CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo (
transaction_date ) WHERE client_id = XXX;
(Aside question: if I were to find a way to use COPY and I were
loading data on a single client_id, would dropping just the indexes for
that client_id accelerate the load?)
2) Find some way to make the bulk loads faster or more efficent (help!)
3) Wrap each load into a transaction ( tens of thousands of records per
load )
Is this a good plan? Is there a better way? Am I walking into a trap?
Should I leave well enough alone and not try and fix something that's
not broken?
FWIW here's the hardware and the profile of the current uber table:
Column | Type | Modifiers
-------------------+---------+-----------
order_number | integer | not null
order_line_number | integer | not null
action_number | integer | not null
transaction_date | date |
code | text |
trans_group_code | text |
quantity | integer |
extension | money |
sales_tax | money |
shipping | money |
discount | money |
Dual Opteron 246, 4 disk SCSI RAID5, 4GB of RAM
# du -sh /var/lib/postgres/data/
16G /var/lib/postgres/data/
( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and
when I do this refactoring )
( the current OS is Debian Unstable but I intend to be running RHEL 4.0
if and when I do this refactoring )