On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote: > I dont know - u can test :) I probably will, but I do have a huge stack of such experiments to run by now, and it's always tricky / takes care to get benchmarks right, avoid disk caches, etc. Certainly I think it would be helpful (or at least hopefully not harmful) to ask here to see if anyone might just know. That's what brought me to this list. :) > > In whole solution it is just one command different - so easy to test and > compare... > > To me it doesnt sound as faster... Sounds as more operation needed what > should be done... > > And produce more problems...i.e what with table foo? What if another table > refference foo etc... Yep, I guess more specifically I was just thinking of dumping to a temp table: CREATE TEMP TABLE tmp AS SELECT * FROM foo; TRUNCATE foo; INSERT INTO foo SELECT * FROM bar UNION SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar); The question I have remaining is whether the bulk UPDATE will be able to update many rows efficiently (smartly order them to do largely sequential scans) - if so, I imagine it would be faster than the above. > > On Saturday, April 27, 2013, Yang Zhang wrote: >> >> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote: >> > Hi, >> > >> > If dataset for update is large... >> > >> > Maybe best would be: >> > >> > From client machine, instead of sending update statements with data - >> > export >> > data to file ready for copy command >> > Transfer file to the server where pg is running >> > Make pgsql function which >> > >> > Create temp table >> > Copy to temp from the file >> > >> > Update original table with values in temp >> > >> > UPDATE foo >> > SET foo.col1 = bar.col1 >> > FROM bar >> > WHERE foo.id = bar.id >> > >> > You dont need to do delete/insert - if you have just update comands.... >> > >> > From client when file is transfered - call your import function on the >> > the >> > server >> > >> > Optionaly you can run vacuum analyze after bulk operation... >> >> But wouldn't a bulk UPDATE touch many existing pages (say, 20% >> scattered around) to mark rows as dead (per MVCC)? I guess it comes >> down to: will PG be smart enough to mark dead rows in largely >> sequential scans (rather than, say, jumping around in whatever order >> rows from foo are yielded by the above join)? >> >> In other words, when considering the alternative of: >> >> CREATE TABLE newfoo AS >> SELECT * FROM bar >> UNION >> SELECT * FROM foo >> WHERE id NOT IN (SELECT id FROM bar); >> >> Wouldn't this alternative be faster? >> >> > >> > Kind regards, >> > >> > Misa >> > >> > >> > >> > On Saturday, April 27, 2013, Yang Zhang wrote: >> >> >> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >> >> > Yang Zhang <yanghatespam@xxxxxxxxx> writes: >> >> >> It currently takes up to 24h for us to run a large set of UPDATE >> >> >> statements on a database, which are of the form: >> >> > >> >> >> UPDATE table SET field1 = constant1, field2 = constant2, ... >> >> >> WHERE >> >> >> id = constid >> >> > >> >> >> (We're just overwriting fields of objects identified by ID.) >> >> > >> >> > Forgive the obvious question, but you do have an index on "id", >> >> > right? >> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of >> >> > these)? >> >> >> >> Totally valid question. That is the primary key with its own index. >> >> Yes, we verified that explain says it just use a simple index scan. >> >> Each individual query runs reasonably quickly (we can run several >> >> dozen such statements per second). >> >> >> >> > >> >> >> The tables have handfuls of indices each and no foreign key >> >> >> constraints. >> >> > >> >> > How much is a "handful"? >> >> >> >> The table with the largest volume of updates (our bottleneck) has four >> >> indexes: >> >> >> >> "account_pkey" PRIMARY KEY, btree (id) >> >> "account_createddate" btree (createddate) >> >> "account_id_prefix" btree (id text_pattern_ops) >> >> "account_recordtypeid" btree (recordtypeid) >> >> >> >> > >> >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like >> >> >> a >> >> >> baseline we should reasonably target. >> >> > >> >> > Well, maybe. You didn't say what percentage of the DB you're >> >> > updating. >> >> >> >> It can be 10-50% of rows changed - a large portion. >> >> >> >> > >> >> > But the thing that comes to mind here is that you're probably >> >> > incurring >> >> > a network round trip for each row, and maybe a query-planning round >> >> > as >> >> > well, so you really can't expect that this is going to be anywhere >> >> > near >> >> > as efficient as a bulk load operation. You could presumably get rid >> >> > of >> >> > the planner overhead by using a prepared statement. Cutting the >> >> > network >> >> > overhead is going to require a bit more ingenuity --- could you move >> >> > some logic into a stored procedure, perhaps, so that one command from >> >> > the client is sufficient to update multiple rows? >> >> >> >> You're right, we're only sequentially issuing (unprepared) UPDATEs. >> >> >> >> If we ship many UPDATE statements per call to our DB API's execution >> >> function (we're using Python's psycopg2 if that matters, but I think >> >> that just binds libpq), would that avoid the network round trip per >> >> statement? >> >> >> >> If not, what if we use anonymous procedures (DO) to run multiple >> >> UPDATE statements? >> >> >> >> Finally, we could use the technique highlighted in my third bullet and >> >> use COPY (or at least multiple-value INSERT), then merging the new >> >> data with the old. Would that be the most direct route to maximum >> >> performance? >> >> >> >> In any case, I assume deleting and rebuilding indexes is important >> >> here, yes? But what about raising checkpoint_segments - does this >> >> actually help sustained throughput? >> >> >> >> > >> >> > regards, tom lane >> >> >> >> >> >> -- >> >> Yang Zhang >> >> http://yz.mit.edu/ >> >> >> >> >> >> -- >> >> Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) >> >> To make changes to your subscription: >> >> http://www.postgresql.org/mailpref/pgsql-general >> >> > -- Yang Zhang http://yz.mit.edu/ -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general