> On Dec 31, 2017, at 4:31 PM, David Rowley <david.rowley@xxxxxxxxxxxxxxx> wrote: > > On 1 January 2018 at 12:06, Rob Sargent <robjsargent@xxxxxxxxx> wrote: >> I must update 3M of 100M records, with tuple specific modifications. I can generate the necessary sql, but I’m wondering if files of simple update statements affecting a single row is more effective than files of a function call doing the same update given the necessary values, including where clause restrictions? The query plan set by the first should be decent for the remainder. >> >> Alternatively, would a bulk load into a table of replacement values and join info be the fastest way? > > It's probably also worth thinking about this table's usage pattern. If > this table is an otherwise static table, then you may wish to think > about the little bit of bloat that doing the UPDATEs in a single > transaction would cause. > Sorry, I didn’t address the question about the table's usage. Currently we’re in a data loading phase and this table is almost completed. Thereafter (post vacuum analyze) it will be 99.99% read-only. The remainder will be updated in much the same fashion as described early (set events_x = events_x + increment). This table will be analyzed a couple ways, mainly determining significance threshold across various slices, each of which examines roughly one tenth of the records. >> Either way I can break the updates into roughly 393 transactions (7500 rows affected per tx) or 8646 transactions (350 rows per tx) if less is more in this world. > > If you were to perform the UPDATEs in batches it would allow you to > run a VACUUM between the UPDATEs. However, it might not be so > important as 3 million rows in 100 million is just 3%, so assuming all > your rows are the same size, then even doing this as a single > transaction would only cause 3% churn on the table. Possibly some of > the UPDATEs would reuse existing free space within the table, but if > they don't then it would only mean an extra 3% bloat. > > As for which is faster. It's most likely going to depend on the query > plan for the UPDATE statements. If you need to perform 3 million seq > scans on the table, by doing 3 million individual statements, that's > likely not going to perform well. 3 million statements is likely not a > good option in any case as it means parsing and planning 3 million > UPDATE statements. Even your 393 statements might not be very good if > each of those UPDATEs must perform a seq scans on the 100 million row > table, but if each of those 393 statements can make use of an index to > easily get those 7500 rows, then that might be a better option than > doing the single UPDATE join method you mentioned. > > It does sound like something you could take offline and benchmark if > performance is that critical. It's not really possible for us to tell > which is faster without seeing the schema, UPDATE statements and query > plans chosen. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services