Re: Batch update query performance

Heikki Linnakangas <hlinnakangas@xxxxxxxxxx> · Mon, 07 Apr 2014 15:53:33 +0300

On 04/07/2014 03:06 PM, Albe Laurenz wrote:
Hans Drexler wrote:
Postgres needs close to 50
minutes to process the same query on the same data. Sometimes, Postgres
needs more than 2 hours.

The application performs an update query on every row
of the table. The exact SQL of this query is:

update t67cdi_nl_cmp_descr set is_grc_002='Y'

[...]

We tried removing all indexes. That reduces the runtime to ~3 minutes.
When we start to put indexes back, the run time of the query increases
again with each index added.

Do I read that right that the duration of the update is reduced from
50 or 120 minutes to 3 when you drop all the indexes?

If that's true, you might be able to drop and re-create the indexes as 
part of the same transaction, and come out ahead. DROP/CREATE INDEX is 
transactional in PostgreSQL, so you can do:

BEGIN;
DROP INDEX index1;
...
DROP INDEX index15;
UPDATE t67cdi_nl_cmp_descr SET is_grc_002='Y'
CREATE INDEX index1 ...;
...
CREATE INDEX index15 ...;
COMMIT;

This will take an AccessExclusiveLock on the table, though, so the table 
will be inaccessible to concurrent queries while it's running.

Actually, since you are effectively rewriting the table anyway, you 
could create a new table with same structure, insert all rows from the 
old table, with is_grc_002 set to 'Y', drop the old table, and rename 
the new table into its place.

Do all the rows really need to be updated? If some of the rows already 
have is_grc_002='Y', you can avoid rewriting those rows by adding a 
WHERE-clause: WHERE NOT is_grc_002='Y' OR is_grc_002 IS NULL.

You could also play tricks with partitioning. Don't store the is_grc_002 
row in the table at all. Instead, create two tables, one for the rows 
that implicitly have is_grc_002='Y' and another for all the other rows. 
Then create a view on the union of the two tables, which adds the 
is_grc_002 column. Instead of doing a full-table update, you can just 
alter the view to display is_grc_002='Y' for both tables (and add a new 
table to hold new rows with is_grc_002<>'Y').

Hypothesis
we have tried many things to solve this problem ourselves, but to no
avail so far. Our hypothesis is that
the Postgres creates new records for all rows and then needs to update
all 15 indexes to make them point to the new rows. There does not seem
to be a way to avoid that.

Question:
- Is our hypothesis correct?
- Can the forum please advise us on possible ways to make the query
faster?

Your hypothesis may be correct.

Yeah, sounds about right. A full-table UPDATE like that is pretty much 
the worst-case scenario for PostgreSQL's MVCC system, unfortunately.

- Heikki

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance