Yeah, I had forgotten to set it up correctly on this test environment
(its value is correctly set in production environments). Putting it to a
few gigabytes here gives me this cost:
bacula=# explain select pathid, filename from batch join path using (path);
QUERY PLAN
----------------------------------------------------------------------------
Nested Loop (cost=0.56..2083904.10 rows=479020 width=26)
-> Seq Scan on batch (cost=0.00..11727.20 rows=479020 width=85)
-> Index Scan using idx_path on path (cost=0.56..4.32 rows=1 width=16)
Index Cond: (path = batch.path)
(4 lignes)
It still chooses the hash join though, but by a smaller margin.
This is still a tangent from your original point, but if I create index on path (path, pathid), then I can get an index only scan. This actually is not much faster when everything is already cached, but the planner thinks it will be about 2x faster because it assumes the vm block accesses are free. So this might be enough to tip it over for you.
And it still only will access a very small part of path (always the same
5000 records) during the query, which isn't accounted for in the cost if
I understand correctly ?
I think you are correct, that it doesn't take account of ndistinct being 10 to 100 fold less than ntuples on the outer loop, which theoretically could propagate down to the table size used in connection with effecitve_cache_size.
It seems to me the cost of the hash join is being greatly underestimated, which I think is more important than the nested loop being overestimated. (And in my hands, the merge join is actually the winner both in the planner and in reality, but I suspect this is because all of your fake paths are lexically greater than all of the real paths)
Also, you talked earlier about cheating the planner by lowering random_page_cost. But why is that cheating? If caching means the cost is truly lower...
Cheers,
Jeff