Re: Performance on Bulk Insert to Partitioned Table

Charles Gomes <charlesrg@xxxxxxxxxxx> · Thu, 27 Dec 2012 11:16:26 -0500

________________________________
> Date: Wed, 26 Dec 2012 23:03:33 -0500 
> Subject: Re:  Performance on Bulk Insert to Partitioned Table 
> From: jeff.janes@xxxxxxxxx 
> To: charlesrg@xxxxxxxxxxx 
> CC: ondrej.ivanic@xxxxxxxxx; pgsql-performance@xxxxxxxxxxxxxx 
>  
> On Monday, December 24, 2012, Charles Gomes wrote: 
> ________________________________ 
>  
>  > 
>  > I think your performance bottleneck is almost certainly the dynamic 
>  > SQL.  Using C to generate that dynamic SQL isn't going to help much, 
>  > because it is still the SQL engine that has to parse, plan, and execute 
>  > it. 
>  > 
>  
> Jeff, I've changed the code from dynamic to: 
>  
> CREATE OR REPLACE FUNCTION quotes_insert_trigger() 
> RETURNS trigger AS $$ 
> DECLARE 
> r_date text; 
> BEGIN 
> r_date = to_char(new.received_time, 'YYYY_MM_DD'); 
> case r_date 
>      when '2012_09_10' then 
>          insert into quotes_2012_09_10 values (NEW.*) using new; 
>          return; 
> ... 
>  
>  
> However I've got no speed improvement. 
> I need to keep two weeks worth of partitions at a time, that's why all  
> the WHEN statements. 
>  
> The 'using new' and return without argument are syntax errors. 
>  
> When I do a model system with those fixed, I get about 2 fold  
> improvement over the dynamic SQL performance.  Even if your performance  
> did not go up, did your CPU usage go down?  Perhaps you have multiple  
> bottlenecks all sitting at about the same place, and so tackling any  
> one of them at a time doesn't get you anywhere. 

I’ve run a small test with the fixes you mentioned and it changed from
1H:20M to, 1H:30M to insert 396000000 rows.

If there was another bottleneck, performance when targeting the
partitions directly would not be twice as fast. I’ve run another long insert
test and it takes 4H:15M to complete using triggers to distribute the inserts.  When targeting It completes in 1H:55M. That’s
both for 70 simultaneous workers with the same data and 1188000000 rows.

The tests that Emmanuel did translating the trigger to C have great
performance improvement. While His code is very general and could work for
anyone using CHECK’s for triggers. I’m still working on fixing it so it’s
compatible with 9.2

So far I’m having a hard time using the C triggers anyway,:

ERROR:  could not load library
"/var/lib/pgsql/pg_trigger_example.so":
/var/lib/pgsql/pg_trigger_example.so: failed to map segment from shared object:
Operation not permitted

I will do more reading on it.

I think having it to work again can bring some value so more people can
be aware of the performance improvement using C instead of PLSQL.

>  
> How does both the dynamic and the CASE scale with the number of  
> threads?  I think you said you had something like 70 sessions, but only  
> 8 CPUs.  That probably will do bad things with contention, and I don't  
> see how using more connections than CPUs is going to help you here.  If  
> the CASE starts out faster in single thread but then flat lines and the  
> EXECUTE catches up, that suggests a different avenue of investigation  
> than they are always the same. 
>  

I didn’t see a significant change in CPU utilization, it seems to be a
bit less, but not that much, however IO is still idling.

>  
> Wish postgres could automate the partition process natively like the  
> other sql db. 
>  
> More automated would be nice (i.e. one operation to make both the check  
> constraints and the trigger, so they can't get out of sync), but would  
> not necessarily mean faster.  I don't know what you mean about other  
> db.  Last time I looked at partitioning in mysql, it was only about  
> breaking up the underlying storage into separate files (without regards  
> to contents of the rows), so that is the same as what postgres does  
> automatically.  And in Oracle, their partitioning seemed about the same  
> as postgres's as far as administrative tedium was concerned.  I'm not  
> familiar with how the MS product handles it, and maybe me experience  
> with the other two are out of date. 

The other free sql DB supports a more elaborated scheme, for example: 

CREATE
TABLE ti (id INT, amount DECIMAL(7,2), tr_date DATE)

    ENGINE=INNODB

    PARTITION BY HASH( MONTH(tr_date) )

    PARTITIONS 6;

It
also supports partitioning by RANGE, LIST or KEY.

The paid one uses a very similar style:CREATE TABLE dept (deptno NUMBER, deptname VARCHAR(32))
     PARTITION BY HASH(deptno) PARTITIONS 16;

Also:
CREATE TABLE sales
  ( prod_id       NUMBER(6)
  , cust_id       NUMBER
  , time_id       DATE
  , quantity_sold NUMBER(3)
  )
 PARTITION BY RANGE (time_id)
 ( PARTITION sales_q1_2006 VALUES LESS THAN (TO_DATE('01-APR-2006','dd-MON-yyyy'))
    TABLESPACE tsa
 , PARTITION sales_q2_2006 VALUES LESS THAN (TO_DATE('01-JUL-2006','dd-MON-yyyy'))
    TABLESPACE tsb
...

>  
> Cheers, 
>  
> Jeff 		 	   		  

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance