Re: COPY v. java performance comparison

Rob Sargent <robjsargent@xxxxxxxxx> · Wed, 02 Apr 2014 18:11:19 -0600



    On 04/02/2014 04:36 PM, Jeff Janes
      wrote:

    
          On Wed, Apr 2, 2014 at 12:37 PM, Rob
            Sargent <robjsargent@xxxxxxxxx>
            wrote:

            
               I'm playing
                    with various data models to compare performance and
                    practicalities and not sure if I should be surprised
                    by the numbers I'm getting. I hope this report isn't
                    too wishy-washy for reasoned comment.

                    
                    One model says a genotype is defined as follows:

                  
                         Table
                      "public.oldstyle"

                      +-------------+--------------+-----------+

                      |   Column    |     Type     | Modifiers |

                      +-------------+--------------+-----------+

                      | id          | uuid         | not null  |

                      | sample_name | text         | not null  |

                      | marker_name | text         | not null  |

                      | allele1     | character(1) |           |

                      | allele2     | character(1) |           |

                      +-------------+--------------+-----------+

                      (0. id is a Primary Key)

                      (1. Take what you will from the table name.)

                      (2. I hadn't thought of "char" type at this point)

                      (3. Ultimately the names would become ids, RI
                      included)

                      (4. We're loading 39 samples and ~950K markers)

                    
                I loaded 37M+ records using jOOQ
                    (batching every 1000 lines) in 12+ hours (800+
                    records/sec).  Then I tried COPY and killed that
                    after 11.25 hours when I realised that I had added
                    on non-unque index on the name fields after the
                    first load. By that point is was on line 28301887,
                    so ~0.75 done which implies it would have take
                    ~15hours to complete.

                
                Would the overhead of the index
                        likely explain this decrease in throughput?

                      
            Absolutely.
             
            
                        Impatience got the better of me and I killed the
                        second COPY.  This time it had done 54% of the
                        file in 6.75 hours, extrapolating to roughly 12
                        hours to do the whole thing.

                      
            Are you sure you actually dropped the indices?  (And
              the primary key?)
            
              
            I get about 375,000 lines per second with no indexes,
              triggers, constraints.
            

            perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++;
              print join "\t",
              "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c
              'truncate oldstyle; copy oldstyle from stdin;'

            
            (More if I truncate it in the same transaction as the
              copy)
            

            If you can't drop the pk constraint, can you at least
              generate the values in sort-order?
            

            Cheers,
            

            Jeff
          
        
    I
        restarted the java-based loading.  Still with pk in place,  (and
        actually generating the UUIDs etc) I got ~1.1M rows in one
        minute. And one "LOG:  checkpoints are occurring too frequently
        (24 seconds apart)" with checkpoint_segment now at 6 (v. default
        3). In  plotting the records v. time in that minute at least
        it's pretty linear.  The time per batch is relatively constant
        with a few hiccups. (.5 sec per 10K lines).  So I've improved
        things immensely but not entirely sure that the simple config
        change is the reason.  If this continued I would be done inside
        40 minutes.

        
        With copy interuptus I now get 2.9M records per minute so the
        load would take only 12 or so minutes. I did get four reports of
        too-frequent checkpoints 2 at 15 seconds 2 at 9 seconds.

        
        I'll need to let each on go to completion.

        
        If these numbers are at all accurate and realistic, I'm still
        impressed with jOOQ, though COPY is rightfully back to its
        proper place as fastest way to load.