Re: Optimizing bulk update performance

Gavin Flower <GavinFlower@xxxxxxxxxxxxxxxxx> · Sat, 27 Apr 2013 12:27:31 +1200

    On 27/04/13 12:14, Yang Zhang wrote:

      It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:

    UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
    id = constid

(We're just overwriting fields of objects identified by ID.)

The tables have handfuls of indices each and no foreign key constraints.
No COMMIT is made till the end.

It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
baseline we should reasonably target.

Short of producing a custom program that somehow reconstructs a dataset
for Postgresql to re-import, is there anything we can do to bring the
bulk UPDATE performance closer to that of the import?  (This is an area
that we believe log-structured merge trees handle well, but we're
wondering if there's anything we can do within Postgresql.)

Some ideas:

- dropping all non-ID indices and rebuilding afterward?
- increasing checkpoint_segments, but does this actually help sustained
  long-term throughput?
- using the techniques mentioned here?  (Load new data as table, then
  "merge in" old data where ID is not found in new data)
  <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@xxxxxxxxxxxxxx>

Basically there's a bunch of things to try and we're not sure what the
most effective are or if we're overlooking other things.  We'll be
spending the next few days experimenting, but we thought we'd ask here
as well.

Thanks.

    People will need to know your version of Postgres &
                  which Operating System etc. plus
                    details of CPU RAM, and Disks... AS well as what
                    changes you have made to
                    postgresql.conf...

      I would be inclined to DROP all indexes and reCREATE them later.

      Updating a row might lead to new
          row being added in a new disk page, so I
            suspect that updates will hit every index associated with
            the table with the (possible exception of partial indexes).

            Running too many updates in one transaction,
              may mean that Postgres may need to use
                disk work files.

                Depending on RAM etc, it may pay to
                  increase some variables tat affect how
                  Postgres
                                uses RAM, some of these
                                  are per session.

                                    Cheers,

                                      Gavin