Search Postgresql Archives

Re: is single row update improved with function

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Dec 31, 2017, at 4:31 PM, David Rowley <david.rowley@xxxxxxxxxxxxxxx> wrote:
> 
> On 1 January 2018 at 12:06, Rob Sargent <robjsargent@xxxxxxxxx> wrote:
>> I must update 3M of 100M records, with tuple specific modifications.  I can generate the necessary sql, but I’m wondering if files of simple update statements affecting a single row is more effective than files of a function call doing the same update given the necessary values, including where clause restrictions?  The query plan set by the first should be decent for the remainder.
>> 
>> Alternatively, would a bulk load into a table of replacement values and join info be the fastest way?
> 
> It's probably also worth thinking about this table's usage pattern. If
> this table is an otherwise static table, then you may wish to think
> about the little bit of bloat that doing the UPDATEs in a single
> transaction would cause.
> 
>> Either way I can break the updates into roughly 393 transactions (7500 rows affected per tx) or 8646 transactions (350 rows  per tx) if less is more in this world.
> 
> If you were to perform the UPDATEs in batches it would allow you to
> run a VACUUM between the UPDATEs. However, it might not be so
> important as 3 million rows in 100 million is just 3%, so assuming all
> your rows are the same size, then even doing this as a single
> transaction would only cause 3% churn on the table. Possibly some of
> the UPDATEs would reuse existing free space within the table, but if
> they don't then it would only mean an extra 3% bloat.
> 
> As for which is faster. It's most likely going to depend on the query
> plan for the UPDATE statements. If you need to perform 3 million seq
> scans on the table, by doing 3 million individual statements, that's
> likely not going to perform well. 3 million statements is likely not a
> good option in any case as it means parsing and planning 3 million
> UPDATE statements. Even your 393 statements might not be very good if
> each of those UPDATEs must perform a seq scans on the 100 million row
> table, but if each of those 393 statements can make use of an index to
> easily get those 7500 rows, then that might be a better option than
> doing the single UPDATE join method you mentioned.
> 
> It does sound like something you could take offline and benchmark if
> performance is that critical. It's not really possible for us to tell
> which is faster without seeing the schema, UPDATE statements and query
> plans chosen.
> 
Thank you, and understood.  Small clarification: the batches (393 or 8646) are not single update statements, rather groupings of separate updates sharing common restriction values. I don’t have the hardware to replicate the problem. 

The table being updated is described below.  Due to an embarrassing cut/paste error, I’ll be updating the three ‘events’ columns based on probandset_id, chrom, startbase, endbase (matching the unique constraint).  The first two fields (probandset_id and probandset_id+chrom) are the possible partitions - strictly from a file generation point of view.  The updates are unique and atomic and could be clumped into transactions arbitrarily. The function would have to take seven args, the key plus the replacement values.

Ultimate sql: update segment set events_less = i, events_equal = j, events_greater = k
		where probanset_id = id and chromosomes = ch and startbase = sb and endbase = eb;
/* I might check that an update is needed be _less != i and _equal != j */

                    Table "segment"
     Column     |  Type   | Collation | Nullable | Default 
----------------+---------+-----------+----------+---------
 id             | uuid    |           | not null | 
 chrom          | integer |           | not null | 
 markerset_id   | uuid    |           | not null | 
 probandset_id  | uuid    |           | not null | 
 startbase      | integer |           | not null | 
 endbase        | integer |           | not null | 
 firstmarker    | integer |           | not null | 
 lastmarker     | integer |           | not null | 
 events_less    | bigint  |           | not null | 0
 events_equal   | bigint  |           | not null | 0
 events_greater | bigint  |           | not null | 0
Indexes:
    "segment_pkey" PRIMARY KEY, btree (id)
    "useg" UNIQUE CONSTRAINT, btree (probandset_id, chrom, startbase, endbase)
    "segment_markerset_id_chrom_firstmarker_idx" btree (markerset_id, chrom, firstmarker)
Foreign-key constraints:
    "segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES probandset(id)
Referenced by:
    TABLE "segmentset_member" CONSTRAINT "segmentset_member_segment_id_fkey" FOREIGN KEY (segment_id) REFERENCES segment(id)


> -- 
> David Rowley                   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux