hash join vs nested loop join

Huan Ruan <leohuanruan@xxxxxxxxx> · Wed, 12 Dec 2012 15:25:48 +1100

Hello All

While investigating switching to Postgres, we come across a query plan 
that uses hash join and is a lot slower than a nested loop join.

I don't understand why the optimiser chooses the hash join in favor of 
the nested loop. What can I do to get the optimiser to make a better 
decision (nested loop in this case)? I have run analyze on both tables.

The query is,

/*

     smalltable has about 48,000 records.

     bigtable has about 168,000,000 records.

     invtranref is char(10) and is the primary key for both tables

  */

  SELECT

    *

  FROM IM_Match_Table smalltable

    inner join invtran bigtable on 
    bigtable.invtranref = smalltable.invtranref

The hash join plan is,

"Hash Join  (cost=1681.87..6414169.04 rows=48261 width=171)"

  "  Output: smalltable.invtranref, smalltable.itbatchref, 
smalltable.trantype, smalltable.trandate, smalltable.invprodref, 
smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, 
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, 
bigtable.trantype, bigtable.trandate, bigtable.pricedate, 
bigtable.units, bigtable.tranamount, bigtable.createmode, 
bigtable.transtat, bigtable.sysversion, bigtable.sysuser, 
bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"

  "  Hash Cond: (bigtable.invtranref = smalltable.invtranref)"

  "  ->  Seq Scan on public.invtran bigtable  (cost=0.00..4730787.28 rows=168121728 width=108)"

  "        Output: bigtable.itbatchref, bigtable.invtranref, 
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, 
bigtable.trantype, bigtable.trandate, bigtable.pricedate, 
bigtable.units, bigtable.tranamount, bigtable.createmode, 
bigtable.transtat, bigtable.sysversion, bigtable.sysuser, 
bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"

  "  ->  Hash  (cost=1078.61..1078.61 rows=48261 width=63)"

  "        Output: smalltable.invtranref, smalltable.itbatchref, 
smalltable.trantype, smalltable.trandate, smalltable.invprodref, 
smalltable.invheadref"

  "        ->  Seq Scan on public.im_match_table smalltable  (cost=0.00..1078.61 rows=48261 width=63)"

  "              Output: smalltable.invtranref, 
smalltable.itbatchref, smalltable.trantype, smalltable.trandate, 
smalltable.invprodref, smalltable.invheadref"

The nested loop join plan is,

"Nested Loop  (cost=0.00..12888684.07 rows=48261 width=171)"

  "  Output: smalltable.invtranref, smalltable.itbatchref, 
smalltable.trantype, smalltable.trandate, smalltable.invprodref, 
smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, 
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, 
bigtable.trantype, bigtable.trandate, bigtable.pricedate, 
bigtable.units, bigtable.tranamount, bigtable.createmode, 
bigtable.transtat, bigtable.sysversion, bigtable.sysuser, 
bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"

  "  ->  Seq Scan on public.im_match_table smalltable  (cost=0.00..1078.61 rows=48261 width=63)"

  "        Output: smalltable.invtranref, smalltable.itbatchref, 
smalltable.trantype, smalltable.trandate, smalltable.invprodref, 
smalltable.invheadref"

  "  ->  Index Scan using pk_invtran on public.invtran bigtable  (cost=0.00..267.03 rows=1 width=108)"

  "        Output: bigtable.itbatchref, bigtable.invtranref, 
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, 
bigtable.trantype, bigtable.trandate, bigtable.pricedate, 
bigtable.units, bigtable.tranamount, bigtable.createmode, 
bigtable.transtat, bigtable.sysversion, bigtable.sysuser, 
bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"

  "        Index Cond: (bigtable.invtranref = smalltable.invtranref)"

The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by
 gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:

Centos, ext4

24GB memory 

6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645). 
raid 10 on 4 sata disks
Config changes are
shared_buffers = 6GB
effective_cache_size = 18GB
work_mem = 10MB
maintenance_work_mem = 3GB
Many Thanks

Huan