Hi all,
I seem to be having some grief with the 9.1.9 query planner favouring an index scan + merge join, over a sequential scan + hash join.
Logically I would have considered the index+merge to be faster, as suggested by the explain output - but in practice, it is in fact slower by orders of magnitude.
In my timings below, I've tried to reduce the impact of any OS or shared_buffer level caching (restarting postgres, and flushing OS cache between queries-).
I've provided my settings as shown:
=# show seq_page_cost;
seq_page_cost
---------------
1
(1 row)
Time: 0.355 ms
=# show random_page_cost;
random_page_cost
------------------
2.2
(1 row)
Time: 0.084 ms
=# show cpu_tuple_cost;
cpu_tuple_cost
----------------
0.01
(1 row)
Time: 0.077 ms
=# show cpu_index_tuple_cost;
cpu_index_tuple_cost
----------------------
0.005
(1 row)
Time: 0.065 ms
=# show cpu_operator_cost;
cpu_operator_cost
-------------------
0.0025
(1 row)
Time: 0.064 ms
=# show effective_cache_size;
effective_cache_size
----------------------
12GB
(1 row)
-- QEP's for 9.1.9
=# explain (analyse,buffers) select * from archive.users inner join live.addresses using (address_id);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=18.79..159615.38 rows=1887786 width=131) (actual time=0.023..602386.955 rows=1862872 loops=1)
Merge Cond: (addresses.address_id = users.address_id)
Buffers: shared hit=1655113 read=382486
-> Index Scan using addresses_pkey on addresses (cost=0.00..52609.75 rows=1872220 width=22) (actual time=0.008..1440.294 rows=1872220 loops=1)
Buffers: shared hit=473352 read=18328
-> Index Scan using address_id_users on users (cost=0.00..3075311.08 rows=73741592 width=117) (actual time=0.005..598455.258 rows=1862873 loops=1)
Buffers: shared hit=1181761 read=364158
Total runtime: 602548.352 ms
(8 rows)
Time: 603090.399 ms
=# set enable_indexscan=off;
SET
Time: 0.219 ms
=# explain (analyse,buffers) select * from archive.users inner join live.addresses using (address_id);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=55340.95..2707034.64 rows=1887786 width=131) (actual time=1279.659..36886.595 rows=1862872 loops=1)
Hash Cond: (users.address_id = addresses.address_id)
Buffers: shared hit=6 read=1079019
-> Seq Scan on users (cost=0.00..1803222.92 rows=73741592 width=117) (actual time=5.082..26430.189 rows=73741544 loops=1)
Buffers: shared hit=2 read=1065805
-> Hash (cost=31938.20..31938.20 rows=1872220 width=22) (actual time=1273.432..1273.432 rows=1872220 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 112381kB
Buffers: shared hit=2 read=13214
-> Seq Scan on addresses (cost=0.00..31938.20 rows=1872220 width=22) (actual time=7.190..553.516 rows=1872220 loops=1)
Buffers: shared hit=2 read=13214
Total runtime: 37014.912 ms
(11 rows)
Time: 37518.029 ms
The only way I can artificially convince the planner to choose the sequential scan method is to increase cpu_index_tuple_cost from 0.005 to 1.4
This suggests something is really really wrong with the statistics on this table, as that shouldn't be necessary.
Interestingly, on another instance of this same database running on postgres 8.3.8, the query planner correctly chooses the sequential scan method - having more sane cost estimates for the index scan method.
-- QEP\s for 8.3.8
=# explain select * from archive.users inner join live.addresses using (address_id);
QUERY PLAN
--------------------------------------------------------------------------------
Hash Join (cost=55340.95..2783655.68 rows=1949180 width=133)
Hash Cond: (users.address_id = addresses.address_id)
-> Seq Scan on users (cost=0.00..1879254.32 rows=73739432 width=119)
-> Hash (cost=31938.20..31938.20 rows=1872220 width=22)
-> Seq Scan on addresses (cost=0.00..31938.20 rows=1872220 width=22)
(5 rows)
=# set enable_seqscan=off;
SET
=# explain select * from archive.users inner join live.addresses using (address_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Merge Join (cost=6.98..3496768.28 rows=1949180 width=133)
Merge Cond: (addresses.address_id = users.address_id)
-> Index Scan using addresses_pkey on addresses (cost=0.00..65190.54 rows=1872220 width=22)
-> Index Scan using address_id_users on users (cost=0.00..128734108.12 rows=73739432 width=119)
(4 rows)
The sequential scan method is most definitely the faster solution (on both 8.3.8 and 9.1.9)
I've rebuilt the indexes, and tried increasing the target statistics for the address_id column and re-analyzing, but to no avail.
If also tried playing with work_mem and effective_cache_size (for a large range of values). These have certainly altered the costs, but never to the point of preferring the seq_scan over the index scan.
I'm at a loss as to where to go next with this. If the index-scan truly should be faster, then I could only suspect IO issues with the disk. Moving the indexes themselves to another drive has had negligible impact to the execution times. I've not tried moving the table data… this might be my next course of action to rule out IO issues on that drive.
If on the other hand, there are no IO issues.. How can I convince postgres to prefer the seq_scan/merge over the index_scan/hash_join ?
Can anyone suggest what else I might look at here?
Thanks.
Tim