COPY v. java performance comparison

Rob Sargent <robjsargent@xxxxxxxxx> · Wed, 02 Apr 2014 13:37:24 -0600



    I'm
        playing with various data models to compare performance and
        practicalities and not sure if I should be surprised by the
        numbers I'm getting. I hope this report isn't too wishy-washy
        for reasoned comment.

        
        One model says a genotype is defined as follows:

      
             Table "public.oldstyle"

          +-------------+--------------+-----------+

          |   Column    |     Type     | Modifiers |

          +-------------+--------------+-----------+

          | id          | uuid         | not null  |

          | sample_name | text         | not null  |

          | marker_name | text         | not null  |

          | allele1     | character(1) |           |

          | allele2     | character(1) |           |

          +-------------+--------------+-----------+

          (0. id is a Primary Key)

          (1. Take what you will from the table name.)

          (2. I hadn't thought of "char" type at this point)

          (3. Ultimately the names would become ids, RI included)

          (4. We're loading 39 samples and ~950K markers)

        
    I
        loaded 37M+ records using jOOQ (batching every 1000 lines) in
        12+ hours (800+ records/sec).  Then I tried COPY and killed that
        after 11.25 hours when I realised that I had added on non-unque
        index on the name fields after the first load. By that point is
        was on line 28301887, so ~0.75 done which implies it would have
        take ~15hours to complete.

    
    Would
            the overhead of the index likely explain this decrease in
            throughput?

            
            Impatience got the better of me and I killed the second
            COPY.  This time it had done 54% of the file in 6.75 hours,
            extrapolating to roughly 12 hours to do the whole thing.

            
            That matches up with the java speed. Not sure if I should be
            elated with jOOQ or disappointed with COPY.

            
            Btw, I can load the roughly the same data in to the model
            below in 10.5 seconds. It only adds 39 very wide lines.  I
            haven't got to the practicality bits yet :)

          
          Table "public.chipcall"

          +-------------+------+-----------+

          |   Column    | Type | Modifiers |

          +-------------+------+-----------+

          | id          | uuid |           |

          | sample_name | text |           |

          | chip_model  | uuid |           |

          | gta         | text |           |

          +-------------+------+-----------+

        
            This just a dev desktop environment: 

            RHEL 6.5

            PostgreSQL 9.3.1 on x86_64-unknown-linux-gnu, compiled by
            gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4 [hm, interesting
            contradiction]), 64-bit

          2 QuadCore cpu MHz: 2925.878

        The input file and postgres data are on separate disks, but only
        one controller.

        
        Thanks in advance, even if you only read this far.