Re: UPDATE on 20 Million Records Transaction or not?

Tim Cross <theophilusx@xxxxxxxxx> · Wed, 24 Jun 2020 10:52:35 +1000

Jason Ralph <jralph@xxxxxxxxxxxxxxxxxxxxx> writes:

> Hello List,
> PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R
> ed Hat 4.4.7-23), 64-bit
>
> I am planning an update on a table with 20Million records, I have been researching the best practices.  I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not.
> My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction.
>
> Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do it?
>

This is really the sort of thing you need to test in a development
environment. There are just too many unknowns to provide a definitive
answer. You need to run this with minimal 'adjustments' in a dev
scenario to get a baseline and then decide what, if anything, you need
to do.

Just because you have 20M records, it doesn't mean that you have to do
anything 'special'. A lot will depend on how many of that 20M need to be
updated, the size of the records and fields being updated, how the
updated records are spread through the table, what other processes are
updating or querying the table during the operation, which indexes are
used and which are unnecessary or just update overhead etc, etc.

Trying to optimise this operation without knowing where the bottlenecks
are is extremely difficult. Unless you know that every one of the 20M
records will be updated, I wouldn't rush into removing all indexes. I
would definitely look at breaking up the update into smaller
transactions.

As to the question about whether to use a transaction or not, it really
depends on your data and how easily you can recover from a partial
transaction. Assume the update is going to be interrupted before
completion. If you can recover from that situation easily, then not
using a full transaction with commit/rollback may be acceptable.
otherwise, I would avoid it.

At any rate, baseline first as you may be surprised. I was recently
working on an application using PG 9.6 which had numerous web clients
and a backend process which ran multiple times a day and which updated
millions of rows each run in a table with over 100M records. The full
update would take around 20 - 30 min, but consisted of multiple
transactions. The data was part of a weather information system used for
predictive modelling. The data consisted of rainfall, max/min temp,
humidity and vapour pressure for 9am and 3pm and solar radiation for
grid points 5km apart covering all of Australia. The table had records
for every day for the last 4+ years and as yet, does not use
partitioning. This is a lot of records, but each record is quite small.
While at some point, we will need to look at additional optimisations
like partitioning, the performance is currently within acceptable
limits. The only optimisation we have done is basic db tuning. When I
did this tuning, performance saw a significant improvement, but none of
it was specific to this large table or the update process.

The key point is that large numbers of records doesn't necessarily mean
that an update will be slow (or more specifically too slow for
acceptable performance, whatever that is). Assuming you will need to
take lots of special action may be premature - you may need to do none
or only a couple of things. Get a baseline first and you will know how
big of an issue you have. You will also be able to determine if what you
try has any benefit. 
-- 
Tim Cross