Re: Hash join on int takes 8..114 seconds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Server has 2 GB RAM.
It has SATA RAID 0,1 integrated controller (1.5Gbps) and SAMSUNG HD160JJ
mirrored disks.

You could perhaps run a little check on the performance of the RAID, is it better than linux software RAID ?
	Does it leverage NCQ appropriately when running queries in parallel ?

 -- Receipt headers:
DOK ( dokumnr  INT SERIAL PRIMARY KEY,
 kuupaev DATE --- sales date
)
 -- Receipt details
RID ( dokumnr INT,
        toode CHAR(20),  -- item code
CONSTRAINT rid_dokumnr_fkey FOREIGN KEY (dokumnr)   REFERENCES dok
(dokumnr),
 CONSTRAINT rid_toode_fkey FOREIGN KEY (toode)
     REFERENCES firma2.toode (toode)
)
 -- Products
TOODE (
 toode CHAR(20) PRIMARY KEY
)

	OK so pretty straightforward :

	dok <-(dokumnr)-> rid <-(toode)-> toode

	toode.toode should really be an INT though.

explain analyze
SELECT sum(1)
  FROM dok JOIN rid USING (dokumnr)
  JOIN toode USING (toode)
  LEFT JOIN artliik using(grupp,liik)
  WHERE rid.toode='X05' AND dok.kuupaev>='2008-09-01'

By the way, note that the presence of the toode table in the query above is not required at all, unless you use columns of toode in your aggregates.

	Let's play with that, after all, it's friday night.

BEGIN;
CREATE TABLE orders (order_id INTEGER NOT NULL, order_date DATE NOT NULL);
CREATE TABLE products (product_id INTEGER NOT NULL, product_name TEXT NOT NULL); CREATE TABLE orders_products (order_id INTEGER NOT NULL, product_id INTEGER NOT NULL, padding1 TEXT, padding2 TEXT);

INSERT INTO products SELECT n, 'product number ' || n::TEXT FROM generate_series(1,40000) AS n; INSERT INTO orders SELECT n,'2000-01-01'::date + (n/1000 * '1 DAY'::interval) FROM generate_series(1,1000000) AS n;

SET work_mem TO '1GB';
INSERT INTO orders_products SELECT a,b,'aibaifbaurgbyioubyfazierugybfoaybofauez', 'hfohbdsqbhjhqsvdfiuazvfgiurvgazrhbazboifhaoifh' FROM (SELECT DISTINCT (1+(n/10))::INTEGER AS a, (1+(random()*39999))::INTEGER AS b FROM generate_series( 1,9999999 ) AS n) AS x;

DELETE FROM orders_products WHERE product_id NOT IN (SELECT product_id FROM products); DELETE FROM orders_products WHERE order_id NOT IN (SELECT order_id FROM orders);
ALTER TABLE orders ADD PRIMARY KEY (order_id);
ALTER TABLE products ADD PRIMARY KEY (product_id);
ALTER TABLE orders_products ADD PRIMARY KEY (order_id,product_id);
ALTER TABLE orders_products ADD FOREIGN KEY (product_id) REFERENCES products( product_id ) ON DELETE CASCADE; ALTER TABLE orders_products ADD FOREIGN KEY (order_id) REFERENCES orders( order_id ) ON DELETE CASCADE;
CREATE INDEX orders_date ON orders( order_date );
COMMIT;
SET work_mem TO DEFAULT;
ANALYZE;

With the following query :

EXPLAIN ANALYZE SELECT sum(1)
FROM orders
JOIN orders_products USING (order_id)
JOIN products USING (product_id)
WHERE orders.order_date BETWEEN '2000-01-01' AND '2000-02-01'
AND products.product_id = 12345;

I get the following results :

orders_products has a PK index on (order_id, product_id). I dropped it.

No index on orders_products :
	=> Big seq scan (16 seconds)

Index on orders_products( product_id ) :
Aggregate (cost=2227.22..2227.23 rows=1 width=0) (actual time=108.204..108.205 rows=1 loops=1) -> Nested Loop (cost=1312.30..2227.20 rows=7 width=0) (actual time=105.929..108.191 rows=6 loops=1) -> Index Scan using products_pkey on products (cost=0.00..8.27 rows=1 width=4) (actual time=0.010..0.014 rows=1 loops=1)
               Index Cond: (product_id = 12345)
-> Hash Join (cost=1312.30..2218.85 rows=7 width=4) (actual time=105.914..108.167 rows=6 loops=1)
               Hash Cond: (orders_products.order_id = orders.order_id)
-> Bitmap Heap Scan on orders_products (cost=6.93..910.80 rows=232 width=8) (actual time=0.194..2.175 rows=246 loops=1)
                     Recheck Cond: (product_id = 12345)
-> Bitmap Index Scan on orders_products_product_id (cost=0.00..6.87 rows=232 width=0) (actual time=0.129..0.129 rows=246 loops=1)
                           Index Cond: (product_id = 12345)
-> Hash (cost=949.98..949.98 rows=28432 width=4) (actual time=105.696..105.696 rows=31999 loops=1) -> Index Scan using orders_date on orders (cost=0.00..949.98 rows=28432 width=4) (actual time=0.059..64.443 rows=31999 loops=1) Index Cond: ((order_date >= '2000-01-01'::date) AND (order_date <= '2000-02-01'::date))
 Total runtime: 108.357 ms
(don't trust this timing, it's a bit cached, this is the same plan as you get)

Index on orders_products( product_id ) and orders_products( order_id ):
	=> Same plan

Note that in this case, a smarter planner would use the new index to perform a BitmapAnd before hitting the heap to get the rows.

Index on ( order_id, product_id ), orders_products( product_id ):
Index on ( order_id, product_id ):
	=> Different plan, slower (especially in second case).

If a "order_date" column is added to the "orders_products" table to make it into some kind of materialized view :

CREATE TABLE orders_products2 AS SELECT orders.order_id, orders.order_date, product_id FROM orders JOIN orders_products USING (order_id);

And an index is created on (product_id, order_date) we get this :

Aggregate (cost=100.44..100.45 rows=1 width=0) (actual time=0.176..0.177 rows=1 loops=1) -> Nested Loop (cost=0.00..100.42 rows=7 width=0) (actual time=0.083..0.168 rows=6 loops=1) -> Index Scan using products_pkey on products (cost=0.00..8.27 rows=1 width=4) (actual time=0.012..0.013 rows=1 loops=1)
               Index Cond: (product_id = 12345)
-> Nested Loop (cost=0.00..92.08 rows=7 width=4) (actual time=0.068..0.147 rows=6 loops=1) -> Index Scan using orders_products2_pid_date on orders_products2 (cost=0.00..33.50 rows=7 width=8) (actual time=0.053..0.076 rows=6 loops=1) Index Cond: ((product_id = 12345) AND (order_date >= '2000-01-01'::date) AND (order_date <= '2000-02-01'::date)) -> Index Scan using orders_pkey on orders (cost=0.00..8.36 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=6) Index Cond: (orders.order_id = orders_products2.order_id)
 Total runtime: 0.246 ms

An index on (order_date,product_id) produces the same effect ; the index scan is slower, but the heap scan uses the same amount of IO.

Two indexes, (order_date) and (product_id), strangely, do not produce a BitmapAnd ; instead a plan with more IO is chosen.


- denormalization (ie adding a column in one of your tables and a
multicolumn index)

For this query it is possible to duplicate kuupaev column to rid table.
However most of the this seems to go to scanning rid table, so I suspect
that this will help.

Yes, most of the time goes to scanning rid table, and this is the time that should be reduced. Adding a date column in "rid" would allow you to create a multicolumn index on rid (dokumnr,date) which would massively speed up the particular query above.
	If you don't create a multicolumn index, this denormalization is useless.

	Basically instead of scanning all rows in "rid" where


- materialized views
- materialized summary tables (ie. summary of sales for last month, for
instance)

There are about 1000 items and reports are different.

	It all depends on what you put in your summary table...



--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux