Re: Why is a hash join preferred when it does not fit in work_mem

David Rowley <dgrowleyml@xxxxxxxxx> · Fri, 13 Jan 2023 14:38:26 +1300

On Fri, 13 Jan 2023 at 07:33, Dimitrios Apostolou <jimis@xxxxxxx> wrote:
>
> I have a very simple NATURAL JOIN that does not fit in the work_mem.  Why
> does the query planner prefer a hash join that needs 361s, while with a
> sort operation and a merge join it takes only 13s?

It's a simple matter of that the Hash Join plan appears cheaper based
on the costs that the planner has calculated.

A better question to ask would be, where are the costs inaccurate? and why.

One thing I noticed in your EXPLAIN ANALYZE output is that the Index
Scan to workitems_ids costed more expensively than the Seq scan, yet
was faster.

> ->  Seq Scan on public.workitem_ids  (cost=0.00..59780.19 rows=1373719 width=237) (actual time=0.026..1912.312 rows=1373737 loops=1)

> ->  Index Scan using workitem_ids_pkey on public.workitem_ids (cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.111..1218.363 rows=1373737 loops=1)

Perhaps the Seq scan is doing more actual I/O than the index scan is.

> The low work_mem and the disabled memoization are set on purpose, in order
> to simplify a complex query, while reproducing the same problem that I
> experienced there. This result is the simplest query I could get, where
> the optimizer does not go for a faster merge join.
>
> From my point of view a merge join is clearly faster, because the hash
> table does not fit in memory and I expect a hash join to do a lot of
> random I/O. But the query planner does not see that, and increasing
> random_page_cost does not help either. In fact the opposite happens: the
> merge join gets a higher cost difference to the hash join, as I increase
> the random page cost!

I'd expect reducing random_page_cost to make the Mege Join cheaper as
that's where the Index Scan is. I'm not quite sure where you think the
random I/O is coming from in a batched hash join.

It would be interesting to see the same plans with SET track_io_timing
= on; set.  It's possible that there's less *actual* I/O going on with
the Merge Join plan vs the Hash Join plan.  Since we do buffered I/O,
without track_io_timing, we don't know if the read buffers resulted in
an actual disk read or a read from the kernel buffers.

David