Hello, I am doing some performances testing on Postgres & I discovered the following behavior, when using 2 different ways of writing selects (but doing the same aggregations at the end): 1. test case 1, using outer join: create table test2 as select soj_session_log_id, pv_timestamp, vi_pv_id,a.item_id, coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '3 day') then 1 else 0 end)) as recent_sales_3d1, coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '7 day') then 1 else 0 end)) as recent_sales_7d1, coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '14 day') then 1 else 0 end)) as recent_sales_14d1, coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '30 day') then 1 else 0 end)) as recent_sales_30d1, coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '60 day') then 1 else 0 end)) as recent_sales_60d1 from bm_us_views_main_1609 a left outer join bm_us_bids b on (b.item_id=a.item_id and b.bid_date<a.pv_timestamp and (b.bid_date>=a.pv_timestamp - INTERVAL '60 day')) where a.item_type in (7,9) and qty>1 group by soj_session_log_id, pv_timestamp, vi_pv_id, a.item_id;; This query doesn't use any index according to the explain plan: "HashAggregate (cost=672109.07..683054.81 rows=182429 width=49)" " -> Merge Left Join (cost=646489.83..668004.42 rows=182429 width=49)" " Merge Cond: (a.item_id = b.item_id)" " Join Filter: ((b.bid_date < a.pv_timestamp) AND (b.bid_date >= (a.pv_timestamp - '60 days'::interval)))" " -> Sort (cost=331768.62..332224.69 rows=182429 width=41)" " Sort Key: a.item_id" " -> Seq Scan on bm_us_views_main_1609 a (cost=0.00..315827.08 rows=182429 width=41)" " Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND (qty > 1))" " -> Sort (cost=314669.01..320949.52 rows=2512205 width=19)" " Sort Key: b.item_id" " -> Seq Scan on bm_us_bids b (cost=0.00..47615.05 rows=2512205 width=19)" 2. Test case 2, using sub queries: create table test2 as select soj_session_log_id, pv_timestamp, vi_pv_id,item_id, coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '3 day' group by item_id ),0) as recent_sales_3d, coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '7 day' group by item_id ),0) as recent_sales_7d, coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '14 day' group by item_id ),0) as recent_sales_14d, coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '30 day' group by item_id ),0) as recent_sales_30d, coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '60 day' group by item_id ),0) as recent_sales_60d from bm_us_views_main_1609 a where item_type in (7,9) and qty>1; This query uses indexes according to the explain plan: "Seq Scan on bm_us_views_main_1609 a (cost=0.00..8720230.77 rows=182429 width=41)" " Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND (qty > 1))" " SubPlan" " -> GroupAggregate (cost=0.00..9.21 rows=1 width=11)" " -> Index Scan using bm_us_bids_item_ix on bm_us_bids b (cost=0.00..9.20 rows=1 width=11)" " Index Cond: ((item_id = $0) AND (bid_date < $1) AND (bid_date >= ($1 - '60 days'::interval)))" " -> GroupAggregate (cost=0.00..9.21 rows=1 width=11)" " -> Index Scan using bm_us_bids_item_ix on bm_us_bids b (cost=0.00..9.20 rows=1 width=11)" " Index Cond: ((item_id = $0) AND (bid_date < $1) AND (bid_date >= ($1 - '30 days'::interval)))" " -> GroupAggregate (cost=0.00..9.21 rows=1 width=11)" " -> Index Scan using bm_us_bids_item_ix on bm_us_bids b (cost=0.00..9.20 rows=1 width=11)" " Index Cond: ((item_id = $0) AND (bid_date < $1) AND (bid_date >= ($1 - '14 days'::interval)))" " -> GroupAggregate (cost=0.00..9.21 rows=1 width=11)" " -> Index Scan using bm_us_bids_item_ix on bm_us_bids b (cost=0.00..9.20 rows=1 width=11)" " Index Cond: ((item_id = $0) AND (bid_date < $1) AND (bid_date >= ($1 - '7 days'::interval)))" " -> GroupAggregate (cost=0.00..9.21 rows=1 width=11)" " -> Index Scan using bm_us_bids_item_ix on bm_us_bids b (cost=0.00..9.20 rows=1 width=11)" " Index Cond: ((item_id = $0) AND (bid_date < $1) AND (bid_date >= ($1 - '3 days'::interval)))" The index bm_us_bids_item_ix is on columns item_id, bidder_id, bid_date QUESTION: Why the planner choose seq scan in the first case & indexes scan in the second case? In a more general way, I observed that the planner has difficulties to select index scans & does in almost all the cases seq scan, when doing join queries. After investigations, it looks like when you join table a with table b on a column x and y and you have an index on column x only, the planner is not able to choose the index scan. You have to build the index corresponding exactly to the join statement btw the 2 tables For example,by creating an new index on item_id and bid_date, the planner has been able to choose this last index in both cases. Would it be possible that the planner can choose in any case the closest index for queries having outer join Last thing, I am running Postgres 8.3.4 on a Windows laptop having 3.5Gb RAM, 161Gb disk and dual core 2.5Gz processor Regards, Julien Theulier
Attachment:
postgresql.conf
Description: Binary data
-- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance