On 12/10/09 3:29 PM, "Scott Carey" <scott@xxxxxxxxxxxxxxxxx> wrote: > On 12/7/09 11:12 AM, "Ben Brehmer" <benbrehmer@xxxxxxxxx> wrote: > >> Thanks for the quick responses. I will respond to all questions in one email: >> >> COPY command: Unfortunately I'm stuck with INSERTS due to the nature this >> data >> was generated (Hadoop/MapReduce). > > If you have control over the MapReduce output, you can have that output > result files in a format that COPY likes. > > If you don't have any control over that its more complicated. I use a final > pass Hadoop Map only job to go over the output and insert into postgres > directly from the job, using the : > > INSERT INTO <table> VALUES (val1, val2, ... ), (val1, val2, ...), ... > Insert style from Java with about 80 rows per insert statement and a single > transaction for about a thousand of these. This was faster than batch > inserts . > I should mention that the above is a bit off. There is an important caveat that each of these individual tasks might run twice in Hadoop (only one will finish -- speculative execution and retry on error). To deal with this you can run each job inside a single transaction so that a failure will rollback, and likely want to turn off speculative execution. Another option is to run only one map job, with no reduce for this sort of work in order to ensure duplicate data is not inserted. We are inserting into a temp table named uniquely per chunk first (sometimes in parallel). Then while holding a posstgres advisory lock we do a SELECT * FROM <temp> INTO <destination> type operation, which is fast. > >> >> On 07/12/2009 10:39 AM, Thom Brown wrote: >>> >>> 2009/12/7 Kevin Grittner <Kevin.Grittner@xxxxxxxxxxxx> >>> >>>> >>>> Ben Brehmer <benbrehmer@xxxxxxxxx> wrote: >>>> >>>>> -7.5 GB memory >>>>> -4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units >>>>> each) >>>>> -64-bit platform >>>> >>>> >>>> What OS? >>>> >>>>> (PostgreSQL 8.1.3) >>>> >>>> Why use such an antiquated, buggy version? Newer versions are >>>> faster. >>>> >>>> -Kevin >>>> >>> >>> >>> >>> >>> >>> >>> I'd agree with trying to use the latest version you can. >>> >>> >>> >>> >>> How are you loading this data? I'd make sure you haven't got any indices, >>> primary keys, triggers or constraints on your tables before you begin the >>> initial load, just add them after. Also use either the COPY command for >>> loading, or prepared transactions. Individual insert commands will just >>> take >>> way too long. >>> >>> >>> >>> >>> Regards >>> >>> >>> >>> >>> Thom >> > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance