On Sat, Apr 27, 2013 at 12:24 AM, Yang Zhang <yanghatespam@xxxxxxxxx> wrote: > On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >> Yang Zhang <yanghatespam@xxxxxxxxx> writes: >>> It currently takes up to 24h for us to run a large set of UPDATE >>> statements on a database, which are of the form: >> >>> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >>> id = constid >> >>> (We're just overwriting fields of objects identified by ID.) >> >> Forgive the obvious question, but you do have an index on "id", right? >> Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)? > > Totally valid question. That is the primary key with its own index. > Yes, we verified that explain says it just use a simple index scan. > Each individual query runs reasonably quickly (we can run several > dozen such statements per second). > >> >>> The tables have handfuls of indices each and no foreign key constraints. >> >> How much is a "handful"? > > The table with the largest volume of updates (our bottleneck) has four indexes: > > "account_pkey" PRIMARY KEY, btree (id) > "account_createddate" btree (createddate) > "account_id_prefix" btree (id text_pattern_ops) > "account_recordtypeid" btree (recordtypeid) > >> >>> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >>> baseline we should reasonably target. >> >> Well, maybe. You didn't say what percentage of the DB you're updating. > > It can be 10-50% of rows changed - a large portion. > >> >> But the thing that comes to mind here is that you're probably incurring >> a network round trip for each row, and maybe a query-planning round as >> well, so you really can't expect that this is going to be anywhere near >> as efficient as a bulk load operation. You could presumably get rid of >> the planner overhead by using a prepared statement. Cutting the network >> overhead is going to require a bit more ingenuity --- could you move >> some logic into a stored procedure, perhaps, so that one command from >> the client is sufficient to update multiple rows? > > You're right, we're only sequentially issuing (unprepared) UPDATEs. > > If we ship many UPDATE statements per call to our DB API's execution > function (we're using Python's psycopg2 if that matters, but I think > that just binds libpq), would that avoid the network round trip per > statement? > > If not, what if we use anonymous procedures (DO) to run multiple > UPDATE statements? > > Finally, we could use the technique highlighted in my third bullet and > use COPY (or at least multiple-value INSERT), then merging the new > data with the old. Would that be the most direct route to maximum > performance? > > In any case, I assume deleting and rebuilding indexes is important > here, yes? But what about raising checkpoint_segments - does this > actually help sustained throughput? (I ask because I'm wondering if raising checkpoint_segments simply postpones inevitable work, or if collecting a larger amount of changes really does dramatically improve throughput somehow.) > >> >> regards, tom lane > > > -- > Yang Zhang > http://yz.mit.edu/ -- Yang Zhang http://yz.mit.edu/ -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general