Re: is single row update improved with function

Rob Sargent <robjsargent@xxxxxxxxx> · Sun, 31 Dec 2017 20:07:25 -0700

> On Dec 31, 2017, at 4:31 PM, David Rowley <david.rowley@xxxxxxxxxxxxxxx> wrote:
> 
> On 1 January 2018 at 12:06, Rob Sargent <robjsargent@xxxxxxxxx> wrote:
>> I must update 3M of 100M records, with tuple specific modifications.  I can generate the necessary sql, but I’m wondering if files of simple update statements affecting a single row is more effective than files of a function call doing the same update given the necessary values, including where clause restrictions?  The query plan set by the first should be decent for the remainder.
>> 
>> Alternatively, would a bulk load into a table of replacement values and join info be the fastest way?
> 
> It's probably also worth thinking about this table's usage pattern. If
> this table is an otherwise static table, then you may wish to think
> about the little bit of bloat that doing the UPDATEs in a single
> transaction would cause.
> 
Sorry, I didn’t address the question about the table's usage.  Currently we’re in a data loading phase and this table is almost completed.  Thereafter (post vacuum analyze) it will be 99.99% read-only.  The remainder will be updated in much the same fashion as described early (set events_x = events_x + increment).

This table will be analyzed a couple ways, mainly determining significance threshold across various slices, each of which examines roughly one tenth of the records.

>> Either way I can break the updates into roughly 393 transactions (7500 rows affected per tx) or 8646 transactions (350 rows  per tx) if less is more in this world.
> 
> If you were to perform the UPDATEs in batches it would allow you to
> run a VACUUM between the UPDATEs. However, it might not be so
> important as 3 million rows in 100 million is just 3%, so assuming all
> your rows are the same size, then even doing this as a single
> transaction would only cause 3% churn on the table. Possibly some of
> the UPDATEs would reuse existing free space within the table, but if
> they don't then it would only mean an extra 3% bloat.
> 
> As for which is faster. It's most likely going to depend on the query
> plan for the UPDATE statements. If you need to perform 3 million seq
> scans on the table, by doing 3 million individual statements, that's
> likely not going to perform well. 3 million statements is likely not a
> good option in any case as it means parsing and planning 3 million
> UPDATE statements. Even your 393 statements might not be very good if
> each of those UPDATEs must perform a seq scans on the 100 million row
> table, but if each of those 393 statements can make use of an index to
> easily get those 7500 rows, then that might be a better option than
> doing the single UPDATE join method you mentioned.
> 
> It does sound like something you could take offline and benchmark if
> performance is that critical. It's not really possible for us to tell
> which is faster without seeing the schema, UPDATE statements and query
> plans chosen.
> 
> -- 
> David Rowley                   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services