self join, parameterized base/join rel path row estimation and generally...

Bongseo Jang <graycells@xxxxxxxxx> · Mon, 27 Jan 2014 14:58:41 +0900

Hi,

I have two plans of a query.
nestloop plan is much faster, but planner chose the slower one, hashjoin.

  http://explain.depesz.com/s/Aqs

  http://explain.depesz.com/s/97C

it seems that rows=39698995  are quite overestimated.

-> Nested Loop (cost=0.000..5403.600 rows=39698995 width=45) (actual time=0.392..14.817 rows=943 loops=1)

   -> Nested Loop (cost=0.000..17.600 rows=1 width=8) (actual time=0.241..0.246 rows=1 loops=1)
         -> Index Scan using seven on hotel three (cost=0.000..6.880 rows=1 width=6) (actual time=0.113..0.115 rows=1 loops=1)

                Index Cond: (two = 31750::numeric)
         -> Index Scan using echo on oscar_foxtrot november (cost=0.000..10.710 rows=1 width=14) (actual time=0.117..0.118 rows=1 loops=1)
                 Index Cond: (charlie = three.golf)

   -> Index Scan using zulu on oscar_foxtrot juliet (cost=0.000..3849.200 rows=153679 width=45) (actual time=0.147..14.241 rows=943 loops=1)
         Index Cond: ((uniform_yankee = november.uniform_yankee) AND (uniform_victor = november.uniform_victor))

pg_stats is like this;
> select attname, null_frac, n_distinct, most_common_vals, most_common_freqs from pg_stats where tablename like 'oscar_foxtrot%' and (attname = 'uniform_yankee' or attname = 'uniform_victor')
"uniform_yankee";0;12;"{83886082,83886085}";"{0.9742,0.02}"
"uniform_victor";0;23;"{1342767106,1342308357}";"{0.973467,0.02}"

I assumed that nestloop rows would be more or less inner_path_rows * outer_path_rows with good pg_stats, and good plan could come based on it.

the plan above is not that case. Suspcious of 40 million rows and small number of values(actually two values) making up 98% of distribution. 
so.. I looked up some code and found that rows=153679 is rows of parameterized base rel estimated by eqsel(), and row=39698995 is rows of parameterized join rel by eqjoinsel().
I think wrong plan above comes from the fact that the two estimation cannot be close in general, great difference in my case.

where am i wrong and right? 
Is there recommended approach, related issue, commit and so on i can follow ?

thanks

[[nstallation Info]]
PostgreSQL-9.2.5 (via postgresql yum repository)
OS: Centos 6.3 (custom linux-3.10.12 kernel)
postgresql.conf:
    effective_cache_size = 10000MB
    shared_buffers = 1000MB
    work_mem = 100MB

    maintenance_work_mem = 100MB
HW: CPU 4-core Xeon x 2 sockets, RAM 256GB 

--
Regards,
Jang.

 a sound mind in a sound body