Re: UPDATE on 20 Million Records Transaction or not?

Tom Lane <tgl@xxxxxxxxxxxxx> · Tue, 23 Jun 2020 21:44:15 -0400

Tim Cross <theophilusx@xxxxxxxxx> writes:
> Jason Ralph <jralph@xxxxxxxxxxxxxxxxxxxxx> writes:
>> I am planning an update on a table with 20Million records, I have been researching the best practices.  I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
>> My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
>> 
>> Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do it?

> Just because you have 20M records, it doesn't mean that you have to do
> anything 'special'.

Indeed.  That's really not a lot of data, on any modern machine.
Just for experiment, I tried updating a table of 20M rows on my
development machine (no powerhouse by any serious standard, plus
it's running a debug build which is likely 30% or more slower than
production):

regression=# \timing
Timing is on.
regression=# create table foo as select x, 'what' as w from generate_series(1,20000000) x;
SELECT 20000000
Time: 23573.818 ms (00:23.574)
regression=# update foo set w = 'other';
UPDATE 20000000
Time: 60878.418 ms (01:00.878)

OK, that's cheating, because I had no indexes, so ...

regression=# create index on foo(w);
CREATE INDEX
Time: 11774.094 ms (00:11.774)
regression=# update foo set w = x::text;
UPDATE 20000000
Time: 186425.606 ms (03:06.426)

That may still be cheating, because I had no foreign-key constraints or
other possibly-expensive features.  And you might have lots of indexes
not just one.  Still, this doesn't seem to be territory where you need to
spend days researching the best alternative.  I'm not even convinced that
it's worth your time to remove and rebuild indexes.

But the real reason I'm commenting is that the original question seems to
embed a whole lot of misunderstanding about how Postgres works.  There is
no such thing as updating "outside a transaction".  You can use an
explicit transaction (ie BEGIN + COMMIT) if you'd like to preserve the
option to roll back after the UPDATE command proper, but that will make
absolutely no difference to performance of the UPDATE.  Likewise, there
is no such thing as not writing WAL (unless you're using an UNLOGGED
table, which you should only ever do for throwaway data).  And also, there's
unlikely to be any performance benefit from carving up the update into
smaller pieces.  If you're seriously short of disk space you might need
to split up the update (because just doing it in one go will about double
the table's size-on-disk) and vacuum between the stages to reclaim disk
space.  But that will take more time not less ... and again, on any
modern machine, such steps shouldn't be necessary for only 20M rows.

			regards, tom lane