HashJoin order, hash the large or small table? Postgres likes to hash the big one, why?

Scott Carey <scott@xxxxxxxxxxxxxxxxx> · Mon, 18 Oct 2010 18:40:11 -0700

8.4.5

I consistently see HashJoin plans that hash the large table, and scan the small table.  This is especially puzzling in some cases where I have 30M rows in the big table and ~ 100 in the small... shouldn't it hash the small table and scan the big one?

Here is one case I saw just recently

               Hash Cond: ((a.e_id)::text = (ta.name)::text)
               ->  Index Scan using c_a_s_e_id on a  (cost=0.00..8.21 rows=14 width=27)
                     Index Cond: (id = 12)
               ->  Hash  (cost=89126.79..89126.79 rows=4825695 width=74)
                     ->  Seq Scan on p_a_1287446030 tmp  (cost=0.00..89126.79 rows=4825695 width=74)
                           Filter: (id = 12)

Does this ever make sense?  Isn't it always better to hash the smaller side of the join, or at least predominantly so?  Maybe if  you want the order of elements returning from the join to coincide with the order of the outer part of the join for a join higher up the plan tree.  in this specific case, I want the order to be based on the larger table for the join higher up (not shown) in the plan so that its index scan is in the order that tmp already is.

Certainly, for very small hash tables (< 1000 entries) the cache effects strongly favor small tables -- the lookup should be very cheap.  Building a very large hash is not cheap, and wastes lots of memory.  I suppose at very large sizes something else might come into play that favors hashing the bigger table, but I can't think of what that would be for the general case.

Any ideas?  I've seen this with dozens of queries, some simple, some with 5 or 6 tables and joins.  I even tried making work_mem very small in a 30M row to 500 row join, and it STILL hashed the big table.  At first I thought that I was reading the plan wrong, but google suggests its doing what it looks like its doing.  Perhaps this is a bug?
-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance