Re: hash join vs nested loop join

"Kevin Grittner" <kgrittn@xxxxxxxx> · Thu, 13 Dec 2012 10:26:24 -0500

Huan Ruan wrote:

> Hash 1st run

> "Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)
> (actual time=2182.450..88158.645 rows=48257 loops=1)"

> " -> Seq Scan on invtran bigtable (cost=0.00..4730787.28
> rows=168121728 width=108) (actual time=0.051..32581.052
> rows=168121657 loops=1)"

194 nanoseconds per row suggests 100% cache hits.

> NL 1st run

> "Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
> time=0.056..551.438 rows=48257 loops=1)"

> " -> Index Scan using pk_invtran on invtran bigtable
> (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
> rows=1 loops=48261)"

10 microseconds per index scan (each index scan requiring multiple
"random" accesses) also suggests 100% cache hits.

> I originally reduced random_page_cost to 2 to achieve the nested
> loop join. Now I set cpu_tuple_cost to 0.05 and reset
> random_page_cost back to 4, I can also achieve a nested loop
> join.
> 
> I'm still new in Postgres, but I'm worried about random_page_cost
> being 2 is too low, so maybe increasing cpu_tuple_cost is a
> better choice.

If these are typical of what you would expect in production, then
the fact that with default cost factors the costs are barely
different (by 0.6%) for actual run times which differ by two orders
of magnitude (the chosen plan is 160 times slower) means that the
modeling of cost factors is off by a lot.

If you expect the active portion of your database to be fully
cached like this, it makes sense to reduce random_page_cost to be
equal to seq_page_cost. But that only adjusts the costs by at most
a factor of four, and we've established that in the above query
they're off by a factor of 160. To help make up the difference, it
makes sense to de-emphasize page access compared to cpu-related
costs by reducing both page costs to 0.1. Combined, these
adjustments still can't compensate for how far off the estimate
was.

In my experience default cpu_tuple_cost is understated compared to
other cpu-related costs, so I would do the above *plus* a boost to
cpu_tuple_cost. Personally, I have never seen a difference between
plans chosen with that set to 0.03 and 0.05, so I can't say where
in that range is the ideal value; you should feel free to
experiment if there is a query which seems to be choosing a bad
plan. If the above results really do represent cache hit levels you
expect in production, the combination of the above changes should
come reasonably close to modeling costs realistically, resulting in
better plan choice.

If you don't expect such high cache hit ratios in production, you
probably don't want to go so low with page costs.

>>> - shared_buffers = 6GB
>>> - effective_cache_size = 18GB
>>> - work_mem = 10MB
>>> - maintenance_work_mem = 3GB

> Can you see any obvious issues with the other memory settings I
> changed?

I might bump up work_mem to 20MB to 60MB, as long as you're not
going crazy with max_connections. I would probably take
maintenance_work_mem down to 1GB to 2GB -- you can have several of
these allocations at one time, and you don't want to blow away your
cache. (I think it might actually be adjusted down to 2GB
internally anyway; but I would need to check.)

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance