Nested loop vs merge join: inconsistencies between estimated and actual time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've came across this issue while writing report-like query for 2 not
very large tables. I've tried several methods to resolve this one (see
below). But now I'm really stuck...
PostgreSQL 8.3, default configuration

There are 2 tables (structure was simplified to show only problematic
place):
create table c
(
 id bigint primary key
 cdate date
);

create index c_cdate_idx on c (cdate);

create table i
(
 id bigint primary key,
 id_c bigint references c(id)
);

select count(*) from c

count
--------
636 565

select count(*) from i

count
--------
4 646 145

analyze i;
analyze c;

explain analyze
select id
from c
 join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-16'

QUERY
PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------

Merge Join  (cost=738.95..57864.63 rows=14479 width=8) (actual
time=13954.681..14358.731 rows=14583
loops=1)
 Merge Cond: (i.idc =
c.id)

 ->  Index Scan using fki_i_c_fk on i  (cost=0.00..194324.34
rows=4646145 width=8) (actual time=17.254..12061.414 rows=1042599 loops=1)
 ->  Sort  (cost=738.94..756.88 rows=7178 width=8) (actual
time=53.942..75.013 rows=14583
loops=1)
       Sort Key:
c.id

       Sort Method:  quicksort  Memory:
404kB

       ->  Index Scan using c_cdate_idx on c  (cost=0.00..279.21
rows=7178 width=8) (actual time=23.595..41.470 rows=7064 loops=1)
             Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
Total runtime: 14379.461
ms


set enable_mergejoin to off;
set enable_hashjoin to off;

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=0.00..59833.70 rows=14479 width=8) (actual
time=0.129..153.038 rows=14583
loops=1)
 ->  Index Scan using  c_cdate_idx on c  (cost=0.00..279.21 rows=7178
width=8) (actual time=0.091..14.468 rows=7064 loops=1)
       Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
 ->  Index Scan using fki_i_c_fk on i  (cost=0.00..8.13 rows=13
width=8) (actual time=0.007..0.011 rows=2 loops=7064)
       Index Cond: (i.idc =
c.id)

Total runtime: 172.599 ms

Ok, the first problem is here:
 ->  Index Scan using fki_i_c_fk on i  (cost=0.00..8.13 rows=13
width=8) (actual time=0.007..0.011 rows=2 loops=7064)

I collected statistics for these tables at level 1000 for all columns.

select attname, null_frac, avg_width, n_distinct, correlation
from pg_stats
where tablename = 'i'

attname     null_frac           avg_width     n_distinct
correlation
----------  ------------------  ------------  -------------
------------------
id     0                   8             -1             0,9999849796295166
idc    0,7236369848251343  8             95583          0,999763011932373

Nice stats except of n_distinct for idc column.

select count(distinct idc)
from i

count
--------
633 864

Of course it is not correct solution but...

update pg_statistic
set stadistinct = 633864
where starelid = ... and staattnum = ...

Reconnect and execute:

explain analyze
select id
from c
 join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-16'

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=0.00..57342.39 rows=14479 width=8) (actual
time=0.133..151.426 rows=14583
loops=1)
 ->  Index Scan using c_cdate_idx on c  (cost=0.00..279.21 rows=7178
width=8) (actual time=0.094..14.242 rows=7064 loops=1)
       Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-16'::date))
 ->  Index Scan using fki_i_c_fk on i  (cost=0.00..7.92 rows=2 width=8)
(actual time=0.007..0.011 rows=2 loops=7064)
       Index Cond: (i.idc =
c.id)

Total runtime: 170.911
ms


But the reason of this issue is not the incorrect value of n_distinct.
Let's expand dates interval in WHERE clause.


explain analyze
select id
from c
 join i on i.idc = c.id
where c.cdate between '2007-02-01' and '2007-02-19'

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------

Merge Join  (cost=831.16..57981.98 rows=16155 width=8) (actual
time=11691.156..12155.201 rows=16357
loops=1)
 Merge Cond: (i.idc =
c.id)

 ->  Index Scan using fki_i_c_fk on i  (cost=0.00..194324.34
rows=4646145 width=8) (actual time=22.236..9928.489 rows=1044373 loops=1)
 ->  Sort  (cost=831.15..851.17 rows=8009 width=8) (actual
time=31.660..55.277 rows=16357
loops=1)
       Sort Key:
c.id

       Sort Method:  quicksort  Memory:
438kB

       ->  Index Scan using c_cdate_idx on c  (cost=0.00..311.87
rows=8009 width=8) (actual time=0.116..17.050 rows=7918 loops=1)
             Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-19'::date))
Total runtime: 12178.678 ms

set enable_mergejoin to off;
set enable_hashjoin to off;

QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=0.00..63724.20 rows=16155 width=8) (actual
time=0.131..171.292 rows=16357
loops=1)
 ->  Index Scan using c_cdate_idx on c  (cost=0.00..311.87 rows=8009
width=8) (actual time=0.093..15.906 rows=7918 loops=1)
       Index Cond: ((cdate >= '2007-02-01'::date) AND (cdate <=
'2007-02-19'::date))
 ->  Index Scan using fki_i_c_fk on i  (cost=0.00..7.89 rows=2 width=8)
(actual time=0.007..0.011 rows=2 loops=7918)
       Index Cond: (i.idc =
c.id)

Total runtime: 193.221 ms

Why nested loop is overestimated here (63000 estimated = 171 actual,
58000 estimated = 12155 actual).


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux