Re: Why is a hash join preferred when it does not fit in work_mem

Dimitrios Apostolou <jimis@xxxxxxx> · Mon, 16 Jan 2023 20:05:30 +0100 (CET)

On Sat, 14 Jan 2023, Tom Lane wrote:

Dimitrios Apostolou <jimis@xxxxxxx> writes:
Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here
is how I understand things according to posts I've read, and classical
algorithms:

+ The Hash Join is fastest when one side fits in work_mem. Then on one
   hand you have a hash table lookup (amortized O(1)) and on the other
   hand, if the table has M rows that that do not fit in memory, you have
   sequential reads from the disk (given low fragmentation of the table or
   index files):  For every line you read from the disk, you lookup the key
   in the hash table.

   If the hash table does not fit in RAM then the cost becomes prohibitive.
   Every lookup is a random access possibly hitting the disk. The total
   cost should be random_page_cost * M.

That would be true of a simple hash join, but Postgres uses batched
hash joins: we split up the hash key space into subsets, where hopefully
each subset includes few enough inner-side rows to fit into work_mem.
While this can go wrong given pathological distribution of the inner-side
keys, it does mean that the join can perform well even when the inner
side is much larger than work_mem.  So it's not the case that the planner
will simply disregard hash joins beyond work_mem.  It will apply a cost
penalty for the predicted batching overhead;

Thanks for this, I found a page [1] that describes the hash join and
now I understand a bit more.

[1] https://www.interdb.jp/pg/pgsql03.html

I'm not sure whether the key distribution is pathological in my case.
The join condition is:

  Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)

and workitem_ids.workitem_n is an integer GENERATED AS IDENTITY and PUBLIC
KEY. The TABLE workitem_ids har 1.7M rows, and the other table has 3.7M
rows. None of them fit in workmem.

In my (simplified and pathological) case of work_mem == 1MB, the hash join
does 512 batches (Buckets: 4,096 Batches: 512 Memory Usage: 759kB). I'm
not sure which hash-merge strategy is followed, but based on that
document, it should be the "hybrid hash join with skew". I don't quite
follow the I/O requirements of this algorithm, yet. :-)

but that can still come out
cheaper than merge join, because the sorting needed for merge is generally
also far from cheap.

I was under the impression that on-disk merge-sort is a relatively cheap
(logN) operation, regarding random I/O.

So I would expect an increased random_page_cost to benefit the Merge Join
algorithm. And since my setup involves spinning disks, it does makes sense
to increase it.

What is probably really happening is that random_page_cost affects the
estimated cost of performing the sort using an index scan instead of
a bespoke sort step.  AFAIR, cost_sort doesn't consider random_page_cost
at all, and neither does cost_hashjoin.

On the last EXPLAIN I posted for the forced merge-join, I see that it uses
an index-scan on the "small" table. It makes sense since the join happens
on the primary key of the table. On the large table it does not use an
index scan, because an index doesn't exist for that column. It sorts the
3.7M rows of the table (and FWIW that table only has two integer columns).
If I understood correctly what you meant with "performing the sort using
an index scan".

The problem I see is that the estimated cost of the sort operation is
609,372.91..618,630.40. It's already way above the whole hash-join cost
(121,222.68..257,633.01). However the real timings are very different.
Actual time for Sort is 4,602.569..5,414.072 ms while for the whole hash
join it is 145,641.295..349,682.387 ms.

Am I missing some configuration knobs to put some sense to the planner?

Thanks,
Dimitris