Re: Optimizing bulk update performance

Misa Simic <misa.simic@xxxxxxxxx> · Sat, 27 Apr 2013 12:37:39 +0200

Well
About best approach with large datasets - rarely there is "always true" best principle...

You will always see there are a few ways - best one just test confirms - depends on many things like hardware os etc... Sometimes even depends on dataset for update...

"
CREATE TEMP TABLE tmp AS
SELECT * FROM foo;

TRUNCATE foo;

INSERT INTO foo
SELECT * FROM bar

UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
"

Above doesnt amke sense to me..

I would do:

CREATE TEMP TABLE tmp AS
SELECT * FROM foo where 1=2;

COPY tmp FROM 'pathtofile';

UPDATE foo

SET foo.col1 = tmp.col1,
.
.
.
SET foo.col15 = tmp.col15
FROM tmp
WHERE foo.id = tmp.id;

In case I know I need just update... If in my dataset I have mix for update and potentially new rows

Instead of update command, I would do

DELETE FROM foo WHERE EXISTS (SELECT 1 FROM tmp WHERE tmp.id = foo.id);
INSERT INTO foo
SELECT * FROM tmp;

On Saturday, April 27, 2013, Yang Zhang  wrote:

On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote:

> I dont know - u can test :)

I probably will, but I do have a huge stack of such experiments to run

by now, and it's always tricky / takes care to get benchmarks right,

avoid disk caches, etc.  Certainly I think it would be helpful (or at

least hopefully not harmful) to ask here to see if anyone might just

know.  That's what brought me to this list.  :)

>

> In whole solution it is just one command different - so easy to test and

> compare...

>

> To me it doesnt sound as faster... Sounds as more operation needed what

> should be done...

>

> And produce more problems...i.e what with table foo? What if another table

> refference foo etc...

Yep, I guess more specifically I was just thinking of dumping to a temp table:

CREATE TEMP TABLE tmp AS

SELECT * FROM foo;

TRUNCATE foo;

INSERT INTO foo

SELECT * FROM bar

UNION

SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);

The question I have remaining is whether the bulk UPDATE will be able

to update many rows efficiently (smartly order them to do largely

sequential scans) - if so, I imagine it would be faster than the

above.

>

> On Saturday, April 27, 2013, Yang Zhang wrote:

>>

>> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote:

>> > Hi,

>> >

>> > If dataset for update is large...

>> >

>> > Maybe best would be:

>> >

>> > From client machine, instead of sending update statements with data -

>> > export

>> > data to file ready for copy command

>> > Transfer file to the server where pg is running

>> > Make pgsql function which

>> >

>> > Create temp table

>> > Copy to temp from the file

>> >

>> > Update original table with values in temp

>> >

>> > UPDATE foo

>> > SET foo.col1 = bar.col1

>> > FROM bar

>> > WHERE foo.id = bar.id

>> >

>> > You dont need to do delete/insert - if you have just update comands....

>> >

>> > From client when file is transfered - call your import function on the

>> > the

>> > server

>> >

>> > Optionaly you can run vacuum analyze after bulk operation...

>>

>> But wouldn't a bulk UPDATE touch many existing pages (say, 20%

>> scattered around) to mark rows as dead (per MVCC)?  I guess it comes

>> down to: will PG be smart enough to mark dead rows in largely

>> sequential scans (rather than, say, jumping around in whatever order

>> rows from foo are yielded by the above join)?

>>

>> In other words, when considering the alternative of:

>>

>> CREATE TABLE newfoo AS

>> SELECT * FROM bar

>> UNION

>> SELECT * FROM foo

>> WHERE id NOT IN (SELECT id FROM bar);

>>

>> Wouldn't this alternative be faster?

>>

>> >

>> > Kind regards,

>> >

>> > Misa

>> >

>> >

>> >

>> > On Saturday, April 27, 2013, Yang Zhang wrote:

>> >>

>> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:

>> >> > Yang Zhang <yanghatespam@xxxxxxxxx> writes:

>> >> >> It currently takes up to 24h for us to run a large set of UPDATE

>> >> >> statements on a database, which are of the form:

>> >> >

>> >> >>     UPDATE table SET field1 = constant1, field2 = constant2, ...

>> >> >> WHERE

>> >> >>     id = constid

>> >> >

>> >> >> (We're just overwriting fields of objects identified by ID.)

>> >> >

>> >> > Forgive the obvious question, but you do have an index on "id",

>> >> > right?

>> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of

>> >> > these)?

>> >>

>> >> Totally valid question.  That is the primary key with its own index.

>> >> Yes, we verified that explain says it just use a simple index scan.

>> >> Each individual query runs reasonably quickly (we can run several

>> >> dozen such statements per second).

>> >>

>> >> >

>> >> >> The tables have handfuls of indices each and no foreign key

>> >> >> constraints.

>> >> >

>> >> > How much is a "handful"?

>> >>

>> >> The table with the largest volume of updates (our bottleneck) has four

>> >> indexes:

>> >>

>> >>     "account_pkey" PRIMARY KEY, btree (id)

>> >>     "account_createddate" btree (createddate)

>> >>     "account_id_prefix" btree (id text_pattern_ops)

>> >>     "account_recordtypeid" btree (recordtypeid)

>> >>

>> >> >

>> >> >> It takes 2h to import a `pg_dump` of the entire DB.  This seems like

>> >> >> a

>