Search Postgresql Archives

Re: Optimizing bulk update performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Well

About best approach with large datasets - rarely there is "always true" best principle...

You will always see there are a few ways - best one just test confirms - depends on many things like hardware os etc... Sometimes even depends on dataset for update...

"
CREATE TEMP TABLE tmp AS
SELECT * FROM foo;

TRUNCATE foo;

INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);

"

Above doesnt amke sense to me..

I would do:

CREATE TEMP TABLE tmp AS
SELECT * FROM foo where 1=2;

COPY tmp FROM 'pathtofile';

UPDATE foo
SET foo.col1 = tmp.col1,
.
.
.
SET foo.col15 = tmp.col15
FROM tmp
WHERE foo.id = tmp.id;

In case I know I need just update... If in my dataset I have mix for update and potentially new rows

Instead of update command, I would do

DELETE FROM foo WHERE EXISTS (SELECT 1 FROM tmp WHERE tmp.id = foo.id);
INSERT INTO foo
SELECT * FROM tmp;




On Saturday, April 27, 2013, Yang Zhang wrote:
On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote:
> I dont know - u can test :)

I probably will, but I do have a huge stack of such experiments to run
by now, and it's always tricky / takes care to get benchmarks right,
avoid disk caches, etc.  Certainly I think it would be helpful (or at
least hopefully not harmful) to ask here to see if anyone might just
know.  That's what brought me to this list.  :)

>
> In whole solution it is just one command different - so easy to test and
> compare...
>
> To me it doesnt sound as faster... Sounds as more operation needed what
> should be done...
>
> And produce more problems...i.e what with table foo? What if another table
> refference foo etc...

Yep, I guess more specifically I was just thinking of dumping to a temp table:

CREATE TEMP TABLE tmp AS
SELECT * FROM foo;

TRUNCATE foo;

INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);

The question I have remaining is whether the bulk UPDATE will be able
to update many rows efficiently (smartly order them to do largely
sequential scans) - if so, I imagine it would be faster than the
above.

>
> On Saturday, April 27, 2013, Yang Zhang wrote:
>>
>> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote:
>> > Hi,
>> >
>> > If dataset for update is large...
>> >
>> > Maybe best would be:
>> >
>> > From client machine, instead of sending update statements with data -
>> > export
>> > data to file ready for copy command
>> > Transfer file to the server where pg is running
>> > Make pgsql function which
>> >
>> > Create temp table
>> > Copy to temp from the file
>> >
>> > Update original table with values in temp
>> >
>> > UPDATE foo
>> > SET foo.col1 = bar.col1
>> > FROM bar
>> > WHERE foo.id = bar.id
>> >
>> > You dont need to do delete/insert - if you have just update comands....
>> >
>> > From client when file is transfered - call your import function on the
>> > the
>> > server
>> >
>> > Optionaly you can run vacuum analyze after bulk operation...
>>
>> But wouldn't a bulk UPDATE touch many existing pages (say, 20%
>> scattered around) to mark rows as dead (per MVCC)?  I guess it comes
>> down to: will PG be smart enough to mark dead rows in largely
>> sequential scans (rather than, say, jumping around in whatever order
>> rows from foo are yielded by the above join)?
>>
>> In other words, when considering the alternative of:
>>
>> CREATE TABLE newfoo AS
>> SELECT * FROM bar
>> UNION
>> SELECT * FROM foo
>> WHERE id NOT IN (SELECT id FROM bar);
>>
>> Wouldn't this alternative be faster?
>>
>> >
>> > Kind regards,
>> >
>> > Misa
>> >
>> >
>> >
>> > On Saturday, April 27, 2013, Yang Zhang wrote:
>> >>
>> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> >> > Yang Zhang <yanghatespam@xxxxxxxxx> writes:
>> >> >> It currently takes up to 24h for us to run a large set of UPDATE
>> >> >> statements on a database, which are of the form:
>> >> >
>> >> >>     UPDATE table SET field1 = constant1, field2 = constant2, ...
>> >> >> WHERE
>> >> >>     id = constid
>> >> >
>> >> >> (We're just overwriting fields of objects identified by ID.)
>> >> >
>> >> > Forgive the obvious question, but you do have an index on "id",
>> >> > right?
>> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of
>> >> > these)?
>> >>
>> >> Totally valid question.  That is the primary key with its own index.
>> >> Yes, we verified that explain says it just use a simple index scan.
>> >> Each individual query runs reasonably quickly (we can run several
>> >> dozen such statements per second).
>> >>
>> >> >
>> >> >> The tables have handfuls of indices each and no foreign key
>> >> >> constraints.
>> >> >
>> >> > How much is a "handful"?
>> >>
>> >> The table with the largest volume of updates (our bottleneck) has four
>> >> indexes:
>> >>
>> >>     "account_pkey" PRIMARY KEY, btree (id)
>> >>     "account_createddate" btree (createddate)
>> >>     "account_id_prefix" btree (id text_pattern_ops)
>> >>     "account_recordtypeid" btree (recordtypeid)
>> >>
>> >> >
>> >> >> It takes 2h to import a `pg_dump` of the entire DB.  This seems like
>> >> >> a
>

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux