On Fri, 13 Jan 2023 at 07:33, Dimitrios Apostolou <jimis@xxxxxxx> wrote: > > I have a very simple NATURAL JOIN that does not fit in the work_mem. Why > does the query planner prefer a hash join that needs 361s, while with a > sort operation and a merge join it takes only 13s? It's a simple matter of that the Hash Join plan appears cheaper based on the costs that the planner has calculated. A better question to ask would be, where are the costs inaccurate? and why. One thing I noticed in your EXPLAIN ANALYZE output is that the Index Scan to workitems_ids costed more expensively than the Seq scan, yet was faster. > -> Seq Scan on public.workitem_ids (cost=0.00..59780.19 rows=1373719 width=237) (actual time=0.026..1912.312 rows=1373737 loops=1) > -> Index Scan using workitem_ids_pkey on public.workitem_ids (cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.111..1218.363 rows=1373737 loops=1) Perhaps the Seq scan is doing more actual I/O than the index scan is. > The low work_mem and the disabled memoization are set on purpose, in order > to simplify a complex query, while reproducing the same problem that I > experienced there. This result is the simplest query I could get, where > the optimizer does not go for a faster merge join. > > From my point of view a merge join is clearly faster, because the hash > table does not fit in memory and I expect a hash join to do a lot of > random I/O. But the query planner does not see that, and increasing > random_page_cost does not help either. In fact the opposite happens: the > merge join gets a higher cost difference to the hash join, as I increase > the random page cost! I'd expect reducing random_page_cost to make the Mege Join cheaper as that's where the Index Scan is. I'm not quite sure where you think the random I/O is coming from in a batched hash join. It would be interesting to see the same plans with SET track_io_timing = on; set. It's possible that there's less *actual* I/O going on with the Merge Join plan vs the Hash Join plan. Since we do buffered I/O, without track_io_timing, we don't know if the read buffers resulted in an actual disk read or a read from the kernel buffers. David