Jason Ralph <jralph@xxxxxxxxxxxxxxxxxxxxx> writes: > Hello List, > PostgreSQL 11.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (R > ed Hat 4.4.7-23), 64-bit > > I am planning an update on a table with 20Million records, I have been researching the best practices. I will remove all indexes and foreign keys prior to the update, however I am not sure if I should use a transaction or not. > My thought process is that a transaction would be easier to recover if something fails, however it would take more time to write to the WAL log in a transaction. > > Would it make sense to make a back up of the table then execute update without a transaction? How would you guys do it? > This is really the sort of thing you need to test in a development environment. There are just too many unknowns to provide a definitive answer. You need to run this with minimal 'adjustments' in a dev scenario to get a baseline and then decide what, if anything, you need to do. Just because you have 20M records, it doesn't mean that you have to do anything 'special'. A lot will depend on how many of that 20M need to be updated, the size of the records and fields being updated, how the updated records are spread through the table, what other processes are updating or querying the table during the operation, which indexes are used and which are unnecessary or just update overhead etc, etc. Trying to optimise this operation without knowing where the bottlenecks are is extremely difficult. Unless you know that every one of the 20M records will be updated, I wouldn't rush into removing all indexes. I would definitely look at breaking up the update into smaller transactions. As to the question about whether to use a transaction or not, it really depends on your data and how easily you can recover from a partial transaction. Assume the update is going to be interrupted before completion. If you can recover from that situation easily, then not using a full transaction with commit/rollback may be acceptable. otherwise, I would avoid it. At any rate, baseline first as you may be surprised. I was recently working on an application using PG 9.6 which had numerous web clients and a backend process which ran multiple times a day and which updated millions of rows each run in a table with over 100M records. The full update would take around 20 - 30 min, but consisted of multiple transactions. The data was part of a weather information system used for predictive modelling. The data consisted of rainfall, max/min temp, humidity and vapour pressure for 9am and 3pm and solar radiation for grid points 5km apart covering all of Australia. The table had records for every day for the last 4+ years and as yet, does not use partitioning. This is a lot of records, but each record is quite small. While at some point, we will need to look at additional optimisations like partitioning, the performance is currently within acceptable limits. The only optimisation we have done is basic db tuning. When I did this tuning, performance saw a significant improvement, but none of it was specific to this large table or the update process. The key point is that large numbers of records doesn't necessarily mean that an update will be slow (or more specifically too slow for acceptable performance, whatever that is). Assuming you will need to take lots of special action may be premature - you may need to do none or only a couple of things. Get a baseline first and you will know how big of an issue you have. You will also be able to determine if what you try has any benefit. -- Tim Cross