Re: regression between 8.4.8 and 8.4.2?

Peter Eisentraut <peter_e@xxxxxxx> · Wed, 31 Aug 2011 10:53:24 +0300

On ons, 2011-08-31 at 10:42 +0300, Peter Eisentraut wrote:
> I don't have an answer for you, but this report looks suspiciously
> similar to the one I posted the other day at
> <http://archives.postgresql.org/pgsql-hackers/2011-08/msg01224.php>,
> which, now that I think about it, also manifested itself after the
> upgrade to 8.4.8.

See this thread:
http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php

It looks like there are a number of users affected by this.

> 
> On tis, 2011-08-30 at 15:24 -0700, Ben Chobot wrote:
> > We recently took a copy of our production data (running on 8.4.2),
> > scrubbed many data fields, and then loaded it onto a qa server
> > (running 8.4.8). We're seeing some odd planner performance that I
> > think might be a bug, though I'm hoping it's just idiocy on my part.
> > I've analyzed things and looked into pg_stats and it seems as if the
> > relevant columns have about the same statistics. 
> > 
> > 
> > I've managed to simplify the query, but if I make it any simpler, then
> > the two servers end up with the same good plan. The query is down to:
> > 
> > 
> > SELECT machines.quota_purchased 
> >         FROM machines
> >         WHERE NOT deleted AND machines.user_id IN (
> >             SELECT id FROM users WHERE user_group_id IN (
> >              select 607547 offset 0
> >             ) OFFSET 0
> >           );
> > 
> > 
> > 
> > 
> > (Those "offset 0" are in there to protect us from planner regressions
> > we saw when moving to 8.4. When we move to 9, they can hopefully go
> > away.)
> > 
> > 
> > On the production server, this returns a fairly accurate plan:
> > 
> > 
> > 
> > QUERY PLAN
> >              
> > ----------------------------------------------------------------------------------------------------------------------------------------------------------
> >  Nested Loop  (cost=843.59..1447.90 rows=243 width=8) (actual
> > time=0.044..0.045 rows=1 loops=1)
> >    ->  HashAggregate  (cost=843.59..845.59 rows=200 width=4) (actual
> > time=0.027..0.027 rows=1 loops=1)
> >          ->  Limit  (cost=0.02..823.90 rows=1575 width=4) (actual
> > time=0.024..0.025 rows=1 loops=1)
> >                ->  Nested Loop  (cost=0.02..823.90 rows=1575 width=4)
> > (actual time=0.023..0.024 rows=1 loops=1)
> >                      ->  HashAggregate  (cost=0.02..0.03 rows=1
> > width=4) (actual time=0.005..0.005 rows=1 loops=1)
> >                            ->  Limit  (cost=0.00..0.01 rows=1 width=0)
> > (actual time=0.001..0.002 rows=1 loops=1)
> >                                  ->  Result  (cost=0.00..0.01 rows=1
> > width=0) (actual time=0.000..0.000 rows=1 loops=1)
> >                      ->  Index Scan using users_user_groups_idx on
> > users  (cost=0.00..804.18 rows=1575 width=8) (actual time=0.017..0.018
> > rows=1 loops=1)
> >                            Index Cond: (users.user_group_id =
> > (607547))
> >    ->  Index Scan using machines_sid_un on machines  (cost=0.00..3.00
> > rows=1 width=12) (actual time=0.015..0.015 rows=1 loops=1)
> >          Index Cond: (machines.user_id = users.id)
> >  Total runtime: 0.121 ms
> > (12 rows)
> > 
> > 
> > 
> > 
> > On the QA server, things are not so accurate. It doesn't hurt the
> > timing of this simplified query much, but when put into the actual
> > query, the row estimation being off by 6 orders of magnitude really
> > throws the planning in the wrong direction. The plan on the QA server
> > is:
> > 
> > 
> > 
> > QUERY PLAN
> >               
> > -----------------------------------------------------------------------------------------------------------------------------------------------------------
> >  Nested Loop  (cost=1887.16..3671.20 rows=1192462 width=8) (actual
> > time=0.049..0.051 rows=1 loops=1)
> >    ->  HashAggregate  (cost=1887.16..1889.16 rows=200 width=4) (actual
> > time=0.032..0.033 rows=1 loops=1)
> >          ->  Limit  (cost=0.02..1868.20 rows=1517 width=4) (actual
> > time=0.027..0.029 rows=1 loops=1)
> >                ->  Nested Loop  (cost=0.02..1868.20 rows=1517 width=4)
> > (actual time=0.027..0.028 rows=1 loops=1)
> >                      ->  HashAggregate  (cost=0.02..0.03 rows=1
> > width=4) (actual time=0.008..0.008 rows=1 loops=1)
> >                            ->  Limit  (cost=0.00..0.01 rows=1 width=0)
> > (actual time=0.001..0.001 rows=1 loops=1)
> >                                  ->  Result  (cost=0.00..0.01 rows=1
> > width=0) (actual time=0.001..0.001 rows=1 loops=1)
> >                      ->  Index Scan using users_user_groups_idx on
> > users  (cost=0.00..1849.20 rows=1517 width=8) (actual
> > time=0.015..0.016 rows=1 loops=1)
> >                            Index Cond: (users.user_group_id =
> > (607547))
> >    ->  Index Scan using machines_sid_un on machines  (cost=0.00..8.90
> > rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
> >          Index Cond: (machines.user_id = users.id)
> >  Total runtime: 0.148 ms
> > (12 rows)
> > 
> > 
> > 
> > 
> > 
> > 
> > The problem here (I think) seems to be that the QA server believes
> > that running a nested loop over 200 users.id values and joining that
> > against machines.user_id will result in >1M rows. The production
> > servers sees this more accurately as the nearly 1:1 relationship that
> > it is.
> > 
> > 
> > The reason I wonder if this might be a bug is because if I change the
> > obtuse clause "WHERE user_group_id IN (select 607547 offset 0)" to
> > simply "where user_group_id in (607547)" then the plan collapses to
> > the same plan on both servers:
> > 
> > 
> > explain analyze SELECT machines.quota_purchased
> >         FROM machines
> >         WHERE NOT deleted AND machines.user_id IN (
> >             SELECT id FROM users WHERE user_group_id IN (
> >              607547         
> >             ) OFFSET 0
> >           );
> > 
> > 
> > 
> > 
> > 
> >  QUERY PLAN
> >          
> > -------------------------------------------------------------------------------------------------------------------------------------------------
> >  Nested Loop  (cost=82.27..636.70 rows=62 width=8) (actual
> > time=0.042..0.043 rows=1 loops=1)
> >    ->  HashAggregate  (cost=82.27..82.88 rows=61 width=4) (actual
> > time=0.024..0.025 rows=1 loops=1)
> >          ->  Limit  (cost=0.00..81.51 rows=61 width=4) (actual
> > time=0.017..0.018 rows=1 loops=1)
> >                ->  Index Scan using users_user_groups_idx on users
> >  (cost=0.00..81.51 rows=61 width=4) (actual time=0.015..0.016 rows=1
> > loops=1)
> >                      Index Cond: (user_group_id = 607547)
> >    ->  Index Scan using machines_sid_un on machines  (cost=0.00..9.07
> > rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
> >          Index Cond: (machines.user_id = users.id)
> >  Total runtime: 0.106 ms
> > (8 rows)
> > 
> > 
> > 
> > 
> > But, as I understand it, that remaining OFFSET 0 should keep the
> > planner from caring that it's a single value or the result of a
> > subselect. 
> > 
> > 
> > 
> > 
> > Anyway, anything I can check? The hardware is different so the
> > configuration files are different, but they're not that different in
> > anything other than effective_cache_size. Certainly not different in
> > any way I can think that would affect this kind of planning
> > mis-estimation. 
> 
> 
> 
> 

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general