Re: Insert into on conflict, data size upto 3 billion records

Ron <ronljohnsonjr@xxxxxxxxx> · Sat, 13 Feb 2021 14:04:17 -0600



    On 2/12/21 12:46 PM, Karthik Kumar Kondamudi wrote:

    
      Hi, 
        

        I'm
            looking for suggestions on how I can improve the performance
            of the below merge statement, we have a batch process that
            batch load the data into the _batch tables using Postgres
            and the task is to update the main target tables if the
            record exists else into it, sometime these batch table could
            go up to 5 billion records. Here is the current scenario

          
          target_table_main has
              700,070,247  records and is hash partitioned into 50
              chunks, it has an index on logical_ts and
              the batch table has 2,715,020,546 close to 3 billion
              records, so I'm dealing with a huge set of data so looking
              of doing this in the most efficient way.
        
      
    Many times, I have drastically sped up batch processing by #1
    partitioning on the same field as an index, and #2 pre-sorting the
    input data by that field.

    
    That way, you get excellent "locality of data" (meaning lots of
    writes to the same hot bits of cache, which later get asynchronously
    flushed to disk).  Unfortunately for your situation, the purpose of
    hash partitioning is to reduce locality of data.  (Sometimes
    that's useful, but not when processing batches.)

    
    -- 

      Angular momentum makes the world go 'round.