was origionally designed for Postgres 7.0 on a PIII 500Mhz and some
Argh.
1) The database is very large, the largest table has 40 million tuples.
Is this simple types (like a few ints, text...) ?
How much space does it use on disk ? can it fit in RAM ?
2) The database needs to import 10's of thousands of tuples each night
quickly. The current method is VERY slow.
You bet, COMMIT'ing after each insert or update is about the worst that
can be done. It works fine on MySQL/MyISAM (which doesn't know about
commit...) so I'd guess the system designer had a previous experience with
MySQL.
My advice woule be :
- get a decent machine with some RAM (I guess you already knew this)...
Now, the update.
I would tend to do this :
- Generate a text file with your update data, using whatever tool you like
best (perl, php, python, java...)
- CREATE TEMPORARY TABLE blah ...
- COPY blah FROM your update file.
COPY is super fast. I think temporary tables don't write to the xlog, so
they are also very fast. This should not take more than a few seconds for
a few 10 K's of simple rows on modern hardware. It actually takes a
fraction of a second on my PC for about 9K rows with 5 INTEGERs on them.
You can also add constraints on your temporary table, to sanitize your
data, in order to be reasonably sure that the following updates will work.
The data you feed to copy should be correct, or it will rollback. This is
your script's job to escape everything.
Now you got your data in the database. You have several options :
- You are confident that the UPDATE will work without being rolled back
by some constraint violation. Therefore, you issue a big joined UPDATE to
update all the rows in your main table which are also in your temp table.
Then you issue an INSERT INTO ... SELECT ... to insert the ones which were
not already in the big table.
Joined updates can be slow if your RAM is too small and it has to thrash
the disk looking for every tuple around.
You can cheat and CLUSTER your main table (say, once a week), so it is
all in index order. Then you arrange your update data so it is in the same
order (for instance, you SELECT INTO another temp table, with an ORDER BY
corresponding to the CLUSTER on the main table). Having both in the same
order will help reducing random disk accesses.
- If you don't like this method, then you might want to use the same
strategy as before (ie. a zillion queries), but write it in PSQL instead.
PSQL is a lot faster, because everything is already parsed and planned
beforehand. So you could do the following :
- for each row in the temporary update table :
- UPDATE the corresponding row in the main table
- IF FOUND, then cool, it was updated, nothing more to do.
You don't need to SELECT in order to know if the row is there.
UPDATE does it for you, without the race condition.
- IF NOT FOUND, then insert.
This has a race condition.
You know your application, so you'll know if it matters or not.
What do you think ?
3) I can't import new records with a COPY or drop my indexes b/c some of
them are new records (INSERTS) and some are altered records (UPDATES)
and the only way I can think of to identify these records is to perform
a select for each record.
Yes and no ; if you must do this, then use PSQL, it's a lot faster. And
skip the SELECT.
Also, use the latest version. It really rocks.
Like many said on the list, put pg_xlog on its own physical disk, with
ext2fs.
3) Wrap each load into a transaction ( tens of thousands of records per
load )
That's the idea. The first strategy here (big update) uses one
transaction anyway. For the other one, your choice. You can either do it
all in 1 transaction, or in bunches of 1000 rows... but 1 row at a time
would be horrendously slow.
Regards,
P.F.C