On Sat, Apr 27, 2013 at 2:54 AM, Yang Zhang <yanghatespam@xxxxxxxxx> wrote: > On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote: >> Hi, >> >> If dataset for update is large... >> >> Maybe best would be: >> >> From client machine, instead of sending update statements with data - export >> data to file ready for copy command >> Transfer file to the server where pg is running >> Make pgsql function which >> >> Create temp table >> Copy to temp from the file >> >> Update original table with values in temp >> >> UPDATE foo >> SET foo.col1 = bar.col1 >> FROM bar >> WHERE foo.id = bar.id >> >> You dont need to do delete/insert - if you have just update comands.... >> >> From client when file is transfered - call your import function on the the >> server >> >> Optionaly you can run vacuum analyze after bulk operation... > > But wouldn't a bulk UPDATE touch many existing pages (say, 20% > scattered around) to mark rows as dead (per MVCC)? I guess it comes > down to: will PG be smart enough to mark dead rows in largely > sequential scans (rather than, say, jumping around in whatever order > rows from foo are yielded by the above join)? (This then begs the question - how might I see this seemingly substantial performance implication, one way or the other, in say EXPLAIN output or something like that?) > > In other words, when considering the alternative of: > > CREATE TABLE newfoo AS > SELECT * FROM bar > UNION > SELECT * FROM foo > WHERE id NOT IN (SELECT id FROM bar); > > Wouldn't this alternative be faster? > >> >> Kind regards, >> >> Misa >> >> >> >> On Saturday, April 27, 2013, Yang Zhang wrote: >>> >>> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >>> > Yang Zhang <yanghatespam@xxxxxxxxx> writes: >>> >> It currently takes up to 24h for us to run a large set of UPDATE >>> >> statements on a database, which are of the form: >>> > >>> >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE >>> >> id = constid >>> > >>> >> (We're just overwriting fields of objects identified by ID.) >>> > >>> > Forgive the obvious question, but you do have an index on "id", right? >>> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)? >>> >>> Totally valid question. That is the primary key with its own index. >>> Yes, we verified that explain says it just use a simple index scan. >>> Each individual query runs reasonably quickly (we can run several >>> dozen such statements per second). >>> >>> > >>> >> The tables have handfuls of indices each and no foreign key >>> >> constraints. >>> > >>> > How much is a "handful"? >>> >>> The table with the largest volume of updates (our bottleneck) has four >>> indexes: >>> >>> "account_pkey" PRIMARY KEY, btree (id) >>> "account_createddate" btree (createddate) >>> "account_id_prefix" btree (id text_pattern_ops) >>> "account_recordtypeid" btree (recordtypeid) >>> >>> > >>> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a >>> >> baseline we should reasonably target. >>> > >>> > Well, maybe. You didn't say what percentage of the DB you're updating. >>> >>> It can be 10-50% of rows changed - a large portion. >>> >>> > >>> > But the thing that comes to mind here is that you're probably incurring >>> > a network round trip for each row, and maybe a query-planning round as >>> > well, so you really can't expect that this is going to be anywhere near >>> > as efficient as a bulk load operation. You could presumably get rid of >>> > the planner overhead by using a prepared statement. Cutting the network >>> > overhead is going to require a bit more ingenuity --- could you move >>> > some logic into a stored procedure, perhaps, so that one command from >>> > the client is sufficient to update multiple rows? >>> >>> You're right, we're only sequentially issuing (unprepared) UPDATEs. >>> >>> If we ship many UPDATE statements per call to our DB API's execution >>> function (we're using Python's psycopg2 if that matters, but I think >>> that just binds libpq), would that avoid the network round trip per >>> statement? >>> >>> If not, what if we use anonymous procedures (DO) to run multiple >>> UPDATE statements? >>> >>> Finally, we could use the technique highlighted in my third bullet and >>> use COPY (or at least multiple-value INSERT), then merging the new >>> data with the old. Would that be the most direct route to maximum >>> performance? >>> >>> In any case, I assume deleting and rebuilding indexes is important >>> here, yes? But what about raising checkpoint_segments - does this >>> actually help sustained throughput? >>> >>> > >>> > regards, tom lane >>> >>> >>> -- >>> Yang Zhang >>> http://yz.mit.edu/ >>> >>> >>> -- >>> Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-general > > > > -- > Yang Zhang > http://yz.mit.edu/ -- Yang Zhang http://yz.mit.edu/ -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general