Re: Speeding up query, Joining 55mil and 43mil records.

nicky <nicky@xxxxxxxxxxxx> · Thu, 22 Jun 2006 11:48:45 +0200

Hello again,

thanks for all the quick replies.

It seems i wasn't entirely correct on my previous post, i've mixed up 
some times/numbers.

Below the correct numbers

MSSQL:      SELECT COUNT(*) from JOIN (without insert)   17 minutes
PostgreSQL: SELECT COUNT(*) from JOIN (without insert)   33 minutes
PostgreSQL: complete query                               55 minutes

The part i'm really troubled with is the difference in performance for 
the select part. Which takes twice as long on PostgreSQL even though it 
has a better server then MSSQL.

Changed i've made to postgressql.conf

work_mem = 524288 (1GB, results in out of memory error)
checkpoints_segments = 256
checkpoints_timeout = 3600
checkpoints_warning = 0

I've ran the complete 'explain analyse query' twice. First with 
pgsql_tmp on the same disk, then again with pgsql_tmp on a seperate disk.

**** (PostgreSQL) (*pgsql_tmp on same disk*):

Hash Join  (cost=1328360.12..6167462.76 rows=7197568 width=118) (actual 
time=327982.425..1903423.769 rows=7551616 loops=1)
 Hash Cond: (("outer".id)::text = ("inner".id)::text)
 ->  Seq Scan on src_faktuur_verrsec t0  (cost=0.00..2773789.90 
rows=40902852 width=52) (actual time=8.935..613455.204 rows=37368390 
loops=1)
       Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND 
(substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1, 
2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND 
((substr((correctie)::text, 4, 1) <> '1'::text) OR (correctie IS NULL)))
 ->  Hash  (cost=1188102.97..1188102.97 rows=8942863 width=80) (actual 
time=327819.698..327819.698 rows=8761024 loops=1)
       ->  Bitmap Heap Scan on src_faktuur_verricht t1  
(cost=62392.02..1188102.97 rows=8942863 width=80) (actual 
time=75911.336..295510.647 rows=8761024 loops=1)
             Recheck Cond: (date_part('year'::text, datum) > 
2004::double precision)
             ->  Bitmap Index Scan on src_faktuur_verricht_idx1  
(cost=0.00..62392.02 rows=8942863 width=0) (actual 
time=75082.080..75082.080 rows=8761024 loops=1)
                   Index Cond: (date_part('year'::text, datum) > 
2004::double precision)
Total runtime: 3355696.015 ms

**** (PostgreSQL) (*pgsql_tmp on seperate disk*)

Hash Join  (cost=1328360.12..6167462.76 rows=7197568 width=118) (actual 
time=172797.736..919869.708 rows=7551616 loops=1)
 Hash Cond: (("outer".id)::text = ("inner".id)::text)
 ->  Seq Scan on src_faktuur_verrsec t0  (cost=0.00..2773789.90 
rows=40902852 width=52) (actual time=0.015..362154.822 rows=37368390 
loops=1)
       Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND 
(substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1, 
2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND 
((substr((correctie)::text, 4, 1) <> '1'::text) OR (correctie IS NULL)))
 ->  Hash  (cost=1188102.97..1188102.97 rows=8942863 width=80) (actual 
time=172759.255..172759.255 rows=8761024 loops=1)
       ->  Bitmap Heap Scan on src_faktuur_verricht t1  
(cost=62392.02..1188102.97 rows=8942863 width=80) (actual 
time=4244.840..142144.606 rows=8761024 loops=1)
             Recheck Cond: (date_part('year'::text, datum) > 
2004::double precision)
             ->  Bitmap Index Scan on src_faktuur_verricht_idx1  
(cost=0.00..62392.02 rows=8942863 width=0) (actual 
time=3431.361..3431.361 rows=8761024 loops=1)
                   Index Cond: (date_part('year'::text, datum) > 
2004::double precision)
Total runtime: 2608316.714 ms

A lot of difference in performance. 55 minutes to 42 minutes.

I've ran the 'select count(*) from JOIN' to see the difference on that 
part.

**** (PostgreSQL) Explain analyse from SELECT COUNT(*) from the JOIN. 
(*pgsql_tmp on seperate disk*)

Aggregate  (cost=5632244.93..5632244.94 rows=1 width=0) (actual 
time=631993.425..631993.427 rows=1 loops=1)
 ->  Hash Join  (cost=1258493.12..5614251.00 rows=7197568 width=0) 
(actual time=237999.277..620018.706 rows=7551616 loops=1)
       Hash Cond: (("outer".id)::text = ("inner".id)::text)
       ->  Seq Scan on src_faktuur_verrsec t0  (cost=0.00..2773789.90 
rows=40902852 width=14) (actual time=23.449..200532.422 rows=37368390 
loops=1)
             Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND 
(substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1, 
2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND 
((substr((correctie)::text, 4, 1) <> '1'::tex (..)
       ->  Hash  (cost=1188102.97..1188102.97 rows=8942863 width=14) 
(actual time=237939.262..237939.262 rows=8761024 loops=1)
             ->  Bitmap Heap Scan on src_faktuur_verricht t1  
(cost=62392.02..1188102.97 rows=8942863 width=14) (actual 
time=74713.092..216206.478 rows=8761024 loops=1)
                   Recheck Cond: (date_part('year'::text, datum) > 
2004::double precision)
                   ->  Bitmap Index Scan on src_faktuur_verricht_idx1  
(cost=0.00..62392.02 rows=8942863 width=0) (actual 
time=73892.153..73892.153 rows=8761024 loops=1)
                         Index Cond: (date_part('year'::text, datum) > 
2004::double precision)
Total runtime: 631994.172 ms

A lot of improvement also in the select count: 33 minutes vs 10 minutes.

To us, the speeds are good. Very happy with the performance increase on 
that select with join, since 90% of the queries are SELECT based.

The query results in 7551616 records, so that's about 4500 inserts per 
second. I'm not sure if that is fast or not. Any further tips would be 
welcome.

Thanks everyone.
Nicky