Re: hash join vs nested loop join

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kevin

Again, many thanks for your time and help.

On 14 December 2012 02:26, Kevin Grittner <kgrittn@xxxxxxxx> wrote:
Huan Ruan wrote:

> Hash 1st run

> "Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)
> (actual time=2182.450..88158.645 rows=48257 loops=1)"

> " -> Seq Scan on invtran bigtable (cost=0.00..4730787.28
> rows=168121728 width=108) (actual time=0.051..32581.052
> rows=168121657 loops=1)"

194 nanoseconds per row suggests 100% cache hits.

> NL 1st run

> "Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
> time=0.056..551.438 rows=48257 loops=1)"

> " -> Index Scan using pk_invtran on invtran bigtable
> (cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
> rows=1 loops=48261)"

10 microseconds per index scan (each index scan requiring multiple
"random" accesses) also suggests 100% cache hits.

Interesting to see how you derived 100% cache hits. I assume by 'cache' you mean the pg shared buffer plus the OS cache? Because the table is 23GB but the shared buffer is only 6GB. Even then, I'm not completely convinced because the total RAM is just 24GB, part of which will have to be used for other data and indexes.

I read somewhere that a pg shared buffer that's too big can hurt the performance and it's better just leave it to the OS cache. I'm not sure why but for now, I just configured the shared buffer to be 1/4 of the total RAM.


> I originally reduced random_page_cost to 2 to achieve the nested
> loop join. Now I set cpu_tuple_cost to 0.05 and reset
> random_page_cost back to 4, I can also achieve a nested loop
> join.
>
> I'm still new in Postgres, but I'm worried about random_page_cost
> being 2 is too low, so maybe increasing cpu_tuple_cost is a
> better choice.

If these are typical of what you would expect in production, then
the fact that with default cost factors the costs are barely
different (by 0.6%) for actual run times which differ by two orders
of magnitude (the chosen plan is 160 times slower) means that the
modeling of cost factors is off by a lot.

If you expect the active portion of your database to be fully
cached like this, it makes sense to reduce random_page_cost to be
equal to seq_page_cost. But that only adjusts the costs by at most
a factor of four, and we've established that in the above query
they're off by a factor of 160. To help make up the difference, it
makes sense to de-emphasize page access compared to cpu-related
costs by reducing both page costs to 0.1. Combined, these
adjustments still can't compensate for how far off the estimate
was.

In my experience default cpu_tuple_cost is understated compared to
other cpu-related costs, so I would do the above *plus* a boost to
cpu_tuple_cost. Personally, I have never seen a difference between
plans chosen with that set to 0.03 and 0.05, so I can't say where
in that range is the ideal value; you should feel free to
experiment if there is a query which seems to be choosing a bad
plan. If the above results really do represent cache hit levels you
expect in production, the combination of the above changes should
come reasonably close to modeling costs realistically, resulting in
better plan choice.

In production, 60% of the database would be able to fit in the RAM. But roughly, all the active data we need to use should be able to fit in 100%. On the test server I'm playing with now, RAM is only 8% of the database size. Nonetheless, I will play with these parameters like you suggested.

I was wondering on our production server where the effetive_cache_size will be much bigger, will pg then guess that probably most data is cached anyway therefore leaning towards nested loop join rather than a scan for hash join?

Even on a test server where the cache hit rate is much smaller, for a big table like this, under what circumstances, will a hash join perform better than nested loop join though? 
 

If you don't expect such high cache hit ratios in production, you
probably don't want to go so low with page costs.

>>> - shared_buffers = 6GB
>>> - effective_cache_size = 18GB
>>> - work_mem = 10MB
>>> - maintenance_work_mem = 3GB

> Can you see any obvious issues with the other memory settings I
> changed?

I might bump up work_mem to 20MB to 60MB, as long as you're not
going crazy with max_connections. I would probably take
maintenance_work_mem down to 1GB to 2GB -- you can have several of
these allocations at one time, and you don't want to blow away your
cache. (I think it might actually be adjusted down to 2GB
internally anyway; but I would need to check.)

Yes, I had bumped up work_mem yesterday to speed up another big group by query. I used 80MB. I assumed this memory will only be used if the query needs it and will be released as soon as it's finished, so it won't be too much an issue as long as I don't have too many concurrently sorting queries running (which is true in our production). Is this correct?

I increased maintenance_work_mem initially to speed up the index creation when I first pump in the data. In production environment, we don't do run time index creation, so I think only the vacuum and analyze will consume this memory?

Thanks
Huan
 

-Kevin


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux