Re: Hash join on int takes 8..114 seconds

PFC <lists@xxxxxxxxxx> · Fri, 21 Nov 2008 19:31:42 +0100

Server has 2 GB RAM.
It has SATA RAID 0,1 integrated controller (1.5Gbps) and SAMSUNG HD160JJ
mirrored disks.

	You could perhaps run a little check on the performance of the RAID, is  
it better than linux software RAID ?
	Does it leverage NCQ appropriately when running queries in parallel ?

 -- Receipt headers:
DOK ( dokumnr  INT SERIAL PRIMARY KEY,
 kuupaev DATE --- sales date
)
 -- Receipt details
RID ( dokumnr INT,
        toode CHAR(20),  -- item code
CONSTRAINT rid_dokumnr_fkey FOREIGN KEY (dokumnr)   REFERENCES dok
(dokumnr),
 CONSTRAINT rid_toode_fkey FOREIGN KEY (toode)
     REFERENCES firma2.toode (toode)
)
 -- Products
TOODE (
 toode CHAR(20) PRIMARY KEY
)

	OK so pretty straightforward :

	dok <-(dokumnr)-> rid <-(toode)-> toode

	toode.toode should really be an INT though.

explain analyze
SELECT sum(1)
  FROM dok JOIN rid USING (dokumnr)
  JOIN toode USING (toode)
  LEFT JOIN artliik using(grupp,liik)
  WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'

	By the way, note that the presence of the toode table in the query above  
is not required at all, unless you use columns of toode in your aggregates.

	Let's play with that, after all, it's friday night.

BEGIN;
CREATE TABLE orders (order_id INTEGER NOT NULL, order_date DATE NOT NULL);
CREATE TABLE products (product_id INTEGER NOT NULL, product_name TEXT NOT  
NULL);
CREATE TABLE orders_products (order_id INTEGER NOT NULL, product_id  
INTEGER NOT NULL, padding1 TEXT, padding2 TEXT);

INSERT INTO products SELECT n, 'product number ' || n::TEXT FROM  
generate_series(1,40000) AS n;
INSERT INTO orders SELECT n,'2000-01-01'::date + (n/1000 * '1  
DAY'::interval) FROM generate_series(1,1000000) AS n;

SET work_mem TO '1GB';
INSERT INTO orders_products SELECT  
a,b,'aibaifbaurgbyioubyfazierugybfoaybofauez',  
'hfohbdsqbhjhqsvdfiuazvfgiurvgazrhbazboifhaoifh'
  FROM (SELECT DISTINCT (1+(n/10))::INTEGER AS a,  
(1+(random()*39999))::INTEGER AS b FROM generate_series( 1,9999999 ) AS n)  
AS x;

DELETE FROM orders_products WHERE product_id NOT IN (SELECT product_id  
FROM products);
DELETE FROM orders_products WHERE order_id NOT IN (SELECT order_id FROM  
orders);
ALTER TABLE orders ADD PRIMARY KEY (order_id);
ALTER TABLE products ADD PRIMARY KEY (product_id);
ALTER TABLE orders_products ADD PRIMARY KEY (order_id,product_id);
ALTER TABLE orders_products ADD FOREIGN KEY (product_id) REFERENCES  
products( product_id ) ON DELETE CASCADE;
ALTER TABLE orders_products ADD FOREIGN KEY (order_id) REFERENCES orders(  
order_id ) ON DELETE CASCADE;
CREATE INDEX orders_date ON orders( order_date );
COMMIT;
SET work_mem TO DEFAULT;
ANALYZE;

With the following query :

EXPLAIN ANALYZE SELECT sum(1)
FROM orders
JOIN orders_products USING (order_id)
JOIN products USING (product_id)
WHERE orders.order_date BETWEEN '2000-01-01' AND '2000-02-01'
AND products.product_id = 12345;

I get the following results :

orders_products has a PK index on (order_id, product_id). I dropped it.

No index on orders_products :
	=> Big seq scan (16 seconds)

Index on orders_products( product_id ) :
 Aggregate  (cost=2227.22..2227.23 rows=1 width=0) (actual  
time=108.204..108.205 rows=1 loops=1)
   ->  Nested Loop  (cost=1312.30..2227.20 rows=7 width=0) (actual  
time=105.929..108.191 rows=6 loops=1)
         ->  Index Scan using products_pkey on products  (cost=0.00..8.27  
rows=1 width=4) (actual time=0.010..0.014 rows=1 loops=1)
               Index Cond: (product_id = 12345)
         ->  Hash Join  (cost=1312.30..2218.85 rows=7 width=4) (actual  
time=105.914..108.167 rows=6 loops=1)
               Hash Cond: (orders_products.order_id = orders.order_id)
               ->  Bitmap Heap Scan on orders_products  (cost=6.93..910.80  
rows=232 width=8) (actual time=0.194..2.175 rows=246 loops=1)
                     Recheck Cond: (product_id = 12345)
                     ->  Bitmap Index Scan on orders_products_product_id   
(cost=0.00..6.87 rows=232 width=0) (actual time=0.129..0.129 rows=246  
loops=1)
                           Index Cond: (product_id = 12345)
               ->  Hash  (cost=949.98..949.98 rows=28432 width=4) (actual  
time=105.696..105.696 rows=31999 loops=1)
                     ->  Index Scan using orders_date on orders   
(cost=0.00..949.98 rows=28432 width=4) (actual time=0.059..64.443  
rows=31999 loops=1)
                           Index Cond: ((order_date >= '2000-01-01'::date)  
AND (order_date <= '2000-02-01'::date))
 Total runtime: 108.357 ms
(don't trust this timing, it's a bit cached, this is the same plan as you  
get)

Index on orders_products( product_id ) and orders_products( order_id ):
	=> Same plan

	Note that in this case, a smarter planner would use the new index to  
perform a BitmapAnd before hitting the heap to get the rows.

Index on ( order_id, product_id ), orders_products( product_id ):
Index on ( order_id, product_id ):
	=> Different plan, slower (especially in second case).

If a "order_date" column is added to the "orders_products" table to make  
it into some kind of materialized view :

CREATE TABLE orders_products2 AS SELECT orders.order_id,  
orders.order_date, product_id FROM orders JOIN orders_products USING  
(order_id);

And an index is created on (product_id, order_date) we get this :

 Aggregate  (cost=100.44..100.45 rows=1 width=0) (actual time=0.176..0.177  
rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..100.42 rows=7 width=0) (actual  
time=0.083..0.168 rows=6 loops=1)
         ->  Index Scan using products_pkey on products  (cost=0.00..8.27  
rows=1 width=4) (actual time=0.012..0.013 rows=1 loops=1)
               Index Cond: (product_id = 12345)
         ->  Nested Loop  (cost=0.00..92.08 rows=7 width=4) (actual  
time=0.068..0.147 rows=6 loops=1)
               ->  Index Scan using orders_products2_pid_date on  
orders_products2  (cost=0.00..33.50 rows=7 width=8) (actual  
time=0.053..0.076 rows=6 loops=1)
                     Index Cond: ((product_id = 12345) AND (order_date >=  
'2000-01-01'::date) AND (order_date <= '2000-02-01'::date))
               ->  Index Scan using orders_pkey on orders   
(cost=0.00..8.36 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=6)
                     Index Cond: (orders.order_id =  
orders_products2.order_id)
 Total runtime: 0.246 ms

An index on (order_date,product_id) produces the same effect ; the index  
scan is slower, but the heap scan uses the same amount of IO.

Two indexes, (order_date) and (product_id), strangely, do not produce a  
BitmapAnd ; instead a plan with more IO is chosen.

- denormalization (ie adding a column in one of your tables and a
multicolumn index)

For this query it is possible to duplicate kuupaev column to rid table.
However most of the this seems to go to scanning rid table, so I suspect
that this will help.

	Yes, most of the time goes to scanning rid table, and this is the time  
that should be reduced.
	Adding a date column in "rid" would allow you to create a multicolumn  
index on rid (dokumnr,date) which would massively speed up the particular  
query above.
	If you don't create a multicolumn index, this denormalization is useless.

	Basically instead of scanning all rows in "rid" where

- materialized views
- materialized summary tables (ie. summary of sales for last month, for
instance)

There are about 1000 items and reports are different.

	It all depends on what you put in your summary table...

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance