Why is a hash join being used?

Tim Jacobs <tjacobs2@xxxxxxxxxxxxx> · Tue, 19 Jun 2012 17:34:56 -0400

I am running the following query:

SELECT res1.x, res1.y, res1.z
FROM test t
JOIN residue_atom_coords res1 ON
		t.struct_id_1 = res1.struct_id AND
		res1.atomno IN (1,2,3,4) AND 
		(res1.seqpos BETWEEN t.pair_1_helix_1_begin AND t.pair_1_helix_1_end)
WHERE
t.compare_id BETWEEN 1 AND 10000;

The 'test' table is very large (~270 million rows) as is the residue_atom_coords table (~540 million rows).

The number of compare_ids I select in the 'WHERE' clause determines the join type in the following way:

t.compare_id BETWEEN 1 AND 5000;

 Nested Loop  (cost=766.52..15996963.12 rows=3316307 width=24)
   ->  Index Scan using test_pkey on test t  (cost=0.00..317.20 rows=5372 width=24)
         Index Cond: ((compare_id >= 1) AND (compare_id <= 5000))
   ->  Bitmap Heap Scan on residue_atom_coords res1  (cost=766.52..2966.84 rows=625 width=44)
         Recheck Cond: ((struct_id = t.struct_id_1) AND (seqpos >= t.pair_1_helix_1_begin) AND (seqpos <= t.pair_1_helix_1_end) AND (atomno = ANY ('{1,2,3,4}'::integer[])))
         ->  Bitmap Index Scan on residue_atom_coords_pkey  (cost=0.00..766.36 rows=625 width=0)
               Index Cond: ((struct_id = t.struct_id_1) AND (seqpos >= t.pair_1_helix_1_begin) AND (seqpos <= t.pair_1_helix_1_end) AND (atomno = ANY ('{1,2,3,4}'::integer[])))

t.compare_id BETWEEN 1 AND 10000;

 Hash Join  (cost=16024139.91..20940899.94 rows=6633849 width=24)
   Hash Cond: (t.struct_id_1 = res1.struct_id)
   Join Filter: ((res1.seqpos >= t.pair_1_helix_1_begin) AND (res1.seqpos <= t.pair_1_helix_1_end))
   ->  Index Scan using test_pkey on test t  (cost=0.00..603.68 rows=10746 width=24)
         Index Cond: ((compare_id >= 1) AND (compare_id <= 10000))
   ->  Hash  (cost=13357564.16..13357564.16 rows=125255660 width=44)
         ->  Seq Scan on residue_atom_coords res1  (cost=0.00..13357564.16 rows=125255660 width=44)
               Filter: (atomno = ANY ('{1,2,3,4}'::integer[]))

The nested loop join performs very quickly, whereas the hash join is incredibly slow. If I disable the hash join temporarily then a nested loop join is used in the second case and is the query runs much more quickly. How can I change my configuration to favor the nested join in this case? Is this a bad idea? Alternatively, since I will be doing selections like this many times, what indexes can be put in place to expedite the query without mucking with the query optimizer? I've already created an index on the struct_id field of residue_atom_coords (each unique struct_id should only have a small number of rows for the residue_atom_coords table).

Thanks in advance,
Tim

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance