Re: Simple join doesn't use index

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thursday, January 3, 2013, Alex Vinnik wrote:
Hi everybody,

I have implemented my first app using PG DB and thought for a minute(may be two) that I know something about PG but below problem totally destroyed my confidence :). Please help me to restore it. 

Here is simple join query. It runs just fine on MS SQL 2008 and uses all available indexes using even bigger overall dataset. 

from visits join views on visits.id = views.visit_id
where visits.created_at >= '11/15/2012' and visits.created_at < '11/16/2012' 

Quick performance stat

MS SQL: 1 second, 264K rows

If it returns 264K rows in 1 second, then it must have all data in memory.  Which prompts a few questions:

Is *all* data in memory, or is it just the data needed for this particular query because you already ran it recently with the same date range?

 
PG: 158 seconds,  264K rows

Does the PG machine have enough memory to hold all the data in RAM?  If so, does it actually have all the data in RAM? That is, is the cache already warm?  Starting from scratch it can take a long time for the cache to warm up naturally.  And finally, if all the data is in RAM, does PG know this?  

For the last question, the answer is "no", since you are using default settings.  You need to lower random_page_cost and probably also seq_page_cost in order to trick PG to think the data is in RAM.  Of course if you do this when the data is in fact not in RAM, the result could be catastrophically bad plans.  (And when I tried to replicate your situation on anemic hardware, indeed the hash join often was faster than the nested loop on both indexes.)


 

Explain plan from both DBs

PG QUERY PLAN
Hash Join  (cost=12716.17..1101820.09 rows=248494 width=8)
  Hash Cond: (views.visit_id = visits.id)
  ->  Seq Scan on views  (cost=0.00..819136.56 rows=17434456 width=8)

This cost estimate is probably due mostly to seg_page_cost and cpu_tuple_cost, which at their defaults means the table has 645,000 blocks (819136 - 17434456/100) blocks and each block has ~30 rows.

But you are returning 248,494 rows, or roughly 1 / 2.5 of a row per block.  Let's say you need to fetch 200,000 blocks (in random order) to get those rows.  Since at default settings fetching 200,000 random blocks is considered as expensive as fetching 800,000 sequential blocks, the index scan you want already looks more expensive than the sequential scan.  But,  if you want to use the index scan, you also have to fetch the index blocks, which a sequential scan does not need to do.  There are probably about 50,000 index blocks, but each one has to be fetched about 5 times (248,494/50,000).  Because your effective_cache_size is so low, PG assumes the next time it needs to fetch the same block, it will no longer be in memory and so needs to be fetched again at full random_page_cost.

 
It is clear that PG does full table scan "Seq Scan on views  (cost=0.00..819136.56 rows=17434456 width=8)"

Don't understand why PG doesn't use views_visit_id_index in that query but rather scans whole table. One explanation I have found that when resulting dataset constitutes ~15% of total number of rows in the table then seq scan is used.

I don't know where you found that rule of thumb, but it would probably more accurate if it was given in in terms of the percentage of the  table's *blocks* scanned, rather than *rows*.



 
In this case resulting dataset is just 1.5% of total number of rows.

Since there are about 30 rows per block, scanning 1.5% of the rows means scanning somewhat less than 45% of the blocks, assuming the rows are randomly distributed over the blocks.  And they are scanned in a less efficient way.


Postgres 9.2

You are probably getting hit hard by the overly-large "fudge factor" penalty for scans of large indexes, of much discussion recently in regards to 9.2.
 
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones


The default effective_cache_size is almost certainly wrong, and if the analogy to MSSQL to is correct, then random_page_cost almost certainly is as well.

Another poster referred you to the wiki page for suggestion on how to report slow queries, particularly using EXPLAIN (analyze, buffers) rather than merely EXPLAIN.  In this case, I would also try setting enable_hashjoin=off and enable_mergejoin=off in the session, in order to force the planner to use the plan you think you want, so we can see what PG thinks of that one.

Cheers,

Jeff

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux