Hi everybody,
I have implemented my first app using PG DB and thought for a minute(may be two) that I know something about PG but below problem totally destroyed my confidence :). Please help me to restore it.
Here is simple join query. It runs just fine on MS SQL 2008 and uses all available indexes using even bigger overall dataset.
Quick performance stat
MS SQL: 1 second, 264K rows
PG: 158 seconds, 264K rows
Explain plan from both DBs
PG QUERY PLAN
Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)
-> Hash (cost=10549.16..10549.16 rows=132081 width=4)
-> Index Scan using visits_created_at_index on visits (cost=0.00..10549.16 rows=132081 width=4)
Index Cond: ((created_at >= '2012-11-15 00:00:00'::timestamp without time zone) AND (created_at < '2012-11-16 00:00:00'::timestamp without time zone))
schemaname | tablename | indexname | tablespace | indexdef
------------+-----------+---------------------------------+------------+------------------------------------------------------------------------------------------
public | views | views_pkey | | CREATE UNIQUE INDEX views_pkey ON views USING btree (id)
public | views | views_visit_id_index | | CREATE INDEX views_visit_id_index ON views USING btree (visit_id)
MS SQL Query plan
'11/16/2012'
|--Parallelism(Gather Streams)
|--Nested Loops(Inner Join, OUTER REFERENCES:([visits].[id], [Expr1006]) OPTIMIZED WITH UNORDERED PREFETCH)
|--Index Seek(OBJECT:([visits].[test]), SEEK:([visits].[created_at] >= '2012-11-15 00:00:00.000' AND [visits].[created_at] < '2012-11-16 00:00:00.000') ORDERED FORWARD)
|--Index Seek(OBJECT:([views].[views_visit_id_index]), SEEK:([views].[visit_id]=[raw_visits].[id]) ORDERED FORWARD)
It is clear that PG does full table scan "Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)"
Don't understand why PG doesn't use views_visit_id_index in that query but rather scans whole table. One explanation I have found that when resulting dataset constitutes ~15% of total number of rows in the table then seq scan is used. In this case resulting dataset is just 1.5% of total number of rows. So it must be something different. Any reason why it happens and how to fix it?
Postgres 9.2
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones
Thanks
-Alex