About best approach with large datasets - rarely there is "always true" best principle...
You will always see there are a few ways - best one just test confirms - depends on many things like hardware os etc... Sometimes even depends on dataset for update...
"
CREATE TEMP TABLE tmp AS
SELECT * FROM foo;
TRUNCATE foo;
INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
SELECT * FROM foo;
TRUNCATE foo;
INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
"
Above doesnt amke sense to me..
I would do:
CREATE TEMP TABLE tmp AS
SELECT * FROM foo where 1=2;
COPY tmp FROM 'pathtofile';
UPDATE foo
SELECT * FROM foo where 1=2;
COPY tmp FROM 'pathtofile';
UPDATE foo
SET foo.col1 = tmp.col1,
.
.
.
SET foo.col15 = tmp.col15
FROM tmp
In case I know I need just update... If in my dataset I have mix for update and potentially new rows
Instead of update command, I would do
INSERT INTO foo
SELECT * FROM tmp;
On Saturday, April 27, 2013, Yang Zhang wrote:
On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote:
> I dont know - u can test :)
I probably will, but I do have a huge stack of such experiments to run
by now, and it's always tricky / takes care to get benchmarks right,
avoid disk caches, etc. Certainly I think it would be helpful (or at
least hopefully not harmful) to ask here to see if anyone might just
know. That's what brought me to this list. :)
>
> In whole solution it is just one command different - so easy to test and
> compare...
>
> To me it doesnt sound as faster... Sounds as more operation needed what
> should be done...
>
> And produce more problems...i.e what with table foo? What if another table
> refference foo etc...
Yep, I guess more specifically I was just thinking of dumping to a temp table:
CREATE TEMP TABLE tmp AS
SELECT * FROM foo;
TRUNCATE foo;
INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
The question I have remaining is whether the bulk UPDATE will be able
to update many rows efficiently (smartly order them to do largely
sequential scans) - if so, I imagine it would be faster than the
above.
>
> On Saturday, April 27, 2013, Yang Zhang wrote:
>>
>> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.simic@xxxxxxxxx> wrote:
>> > Hi,
>> >
>> > If dataset for update is large...
>> >
>> > Maybe best would be:
>> >
>> > From client machine, instead of sending update statements with data -
>> > export
>> > data to file ready for copy command
>> > Transfer file to the server where pg is running
>> > Make pgsql function which
>> >
>> > Create temp table
>> > Copy to temp from the file
>> >
>> > Update original table with values in temp
>> >
>> > UPDATE foo
>> > SET foo.col1 = bar.col1
>> > FROM bar
>> > WHERE foo.id = bar.id
>> >
>> > You dont need to do delete/insert - if you have just update comands....
>> >
>> > From client when file is transfered - call your import function on the
>> > the
>> > server
>> >
>> > Optionaly you can run vacuum analyze after bulk operation...
>>
>> But wouldn't a bulk UPDATE touch many existing pages (say, 20%
>> scattered around) to mark rows as dead (per MVCC)? I guess it comes
>> down to: will PG be smart enough to mark dead rows in largely
>> sequential scans (rather than, say, jumping around in whatever order
>> rows from foo are yielded by the above join)?
>>
>> In other words, when considering the alternative of:
>>
>> CREATE TABLE newfoo AS
>> SELECT * FROM bar
>> UNION
>> SELECT * FROM foo
>> WHERE id NOT IN (SELECT id FROM bar);
>>
>> Wouldn't this alternative be faster?
>>
>> >
>> > Kind regards,
>> >
>> > Misa
>> >
>> >
>> >
>> > On Saturday, April 27, 2013, Yang Zhang wrote:
>> >>
>> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> >> > Yang Zhang <yanghatespam@xxxxxxxxx> writes:
>> >> >> It currently takes up to 24h for us to run a large set of UPDATE
>> >> >> statements on a database, which are of the form:
>> >> >
>> >> >> UPDATE table SET field1 = constant1, field2 = constant2, ...
>> >> >> WHERE
>> >> >> id = constid
>> >> >
>> >> >> (We're just overwriting fields of objects identified by ID.)
>> >> >
>> >> > Forgive the obvious question, but you do have an index on "id",
>> >> > right?
>> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of
>> >> > these)?
>> >>
>> >> Totally valid question. That is the primary key with its own index.
>> >> Yes, we verified that explain says it just use a simple index scan.
>> >> Each individual query runs reasonably quickly (we can run several
>> >> dozen such statements per second).
>> >>
>> >> >
>> >> >> The tables have handfuls of indices each and no foreign key
>> >> >> constraints.
>> >> >
>> >> > How much is a "handful"?
>> >>
>> >> The table with the largest volume of updates (our bottleneck) has four
>> >> indexes:
>> >>
>> >> "account_pkey" PRIMARY KEY, btree (id)
>> >> "account_createddate" btree (createddate)
>> >> "account_id_prefix" btree (id text_pattern_ops)
>> >> "account_recordtypeid" btree (recordtypeid)
>> >>
>> >> >
>> >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like
>> >> >> a
>