Why is a hash join preferred when it does not fit in work_mem

Dimitrios Apostolou <jimis@xxxxxxx> · Thu, 12 Jan 2023 17:24:20 +0100 (CET)

Hello list,

I have a very simple NATURAL JOIN that does not fit in the work_mem.  Why
does the query planner prefer a hash join that needs 361s, while with a
sort operation and a merge join it takes only 13s?

The server is an old Mac Mini with hard disk drive and only 4GB RAM.
Postgres version info:

 PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang
version 12.0.0 (clang-1200.0.32.29), 64-bit

The low work_mem and the disabled memoization are set on purpose, in order
to simplify a complex query, while reproducing the same problem that I
experienced there. This result is the simplest query I could get, where
the optimizer does not go for a faster merge join.

From my point of view a merge join is clearly faster, because the hash
table does not fit in memory and I expect a hash join to do a lot of
random I/O. But the query planner does not see that, and increasing
random_page_cost does not help either. In fact the opposite happens: the
merge join gets a higher cost difference to the hash join, as I increase
the random page cost!

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM
tasks_mm_workitems NATURAL JOIN workitem_ids;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=121222.68..257633.01 rows=3702994 width=241) (actual
time=184498.464..360606.257 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n,
workitem_ids.workitem_id
   Inner Unique: true
   Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)
   Buffers: shared hit=15068 read=47434, temp read=56309 written=56309
   ->  Seq Scan on public.tasks_mm_workitems  (cost=0.00..53488.94
rows=3702994 width=8) (actual time=0.040..1376.084 rows=3702994 loops=1)
         Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
         Buffers: shared read=16459
   ->  Hash  (cost=59780.19..59780.19 rows=1373719 width=237) (actual
time=184361.874..184361.875 rows=1373737 loops=1)
         Output: workitem_ids.workitem_id, workitem_ids.workitem_n
         Buckets: 4096  Batches: 512  Memory Usage: 759kB
         Buffers: shared hit=15068 read=30975, temp written=43092
         ->  Seq Scan on public.workitem_ids  (cost=0.00..59780.19
rows=1373719 width=237) (actual time=0.026..1912.312 rows=1373737 loops=1)
               Output: workitem_ids.workitem_id, workitem_ids.workitem_n
               Buffers: shared hit=15068 read=30975
 Settings: effective_cache_size = '500MB', enable_memoize = 'off',
hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem
= '1MB'
 Planning:
   Buffers: shared hit=2 read=6
 Planning Time: 0.568 ms
 Execution Time: 361106.876 ms
(20 rows)

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM
tasks_mm_workitems NATURAL JOIN workitem_ids;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=609453.49..759407.78 rows=3702994 width=241) (actual time=5062.513..10866.313 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id
   Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n)
   Buffers: shared hit=5343 read=66053, temp read=32621 written=32894
   ->  Index Scan using workitem_ids_pkey on public.workitem_ids
(cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.111..1218.363 rows=1373737 loops=1)
         Output: workitem_ids.workitem_n, workitem_ids.workitem_id
         Buffers: shared hit=5310 read=49627
   ->  Materialize  (cost=609372.91..627887.88 rows=3702994 width=8) (actual time=5062.389..7392.640 rows=3702994 loops=1)
         Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
         Buffers: shared hit=33 read=16426, temp read=32621 written=32894
         ->  Sort  (cost=609372.91..618630.40 rows=3702994 width=8) (actual time=5062.378..6068.703 rows=3702994 loops=1)
               Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
               Sort Key: tasks_mm_workitems.workitem_n
               Sort Method: external merge  Disk: 65256kB
               Buffers: shared hit=33 read=16426, temp read=32621 written=32894
               ->  Seq Scan on public.tasks_mm_workitems
(cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.045..1177.202 rows=3702994 loops=1)
                     Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
                     Buffers: shared hit=33 read=16426
 Settings: effective_cache_size = '500MB', enable_hashjoin = 'off', enable_memoize = 'off', hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = '1MB'
 Planning:
   Buffers: shared hit=8
 Planning Time: 0.677 ms
 Execution Time: 13364.545 ms
(23 rows)

Thank you in advance,
Dimitris