Re: regression between 8.4.8 and 8.4.2?

Peter Eisentraut <peter_e@xxxxxxx> · Wed, 31 Aug 2011 10:42:54 +0300

I don't have an answer for you, but this report looks suspiciously
similar to the one I posted the other day at
<http://archives.postgresql.org/pgsql-hackers/2011-08/msg01224.php>,
which, now that I think about it, also manifested itself after the
upgrade to 8.4.8.

On tis, 2011-08-30 at 15:24 -0700, Ben Chobot wrote:
> We recently took a copy of our production data (running on 8.4.2),
> scrubbed many data fields, and then loaded it onto a qa server
> (running 8.4.8). We're seeing some odd planner performance that I
> think might be a bug, though I'm hoping it's just idiocy on my part.
> I've analyzed things and looked into pg_stats and it seems as if the
> relevant columns have about the same statistics. 
> 
> 
> I've managed to simplify the query, but if I make it any simpler, then
> the two servers end up with the same good plan. The query is down to:
> 
> 
> SELECT machines.quota_purchased 
>         FROM machines
>         WHERE NOT deleted AND machines.user_id IN (
>             SELECT id FROM users WHERE user_group_id IN (
>              select 607547 offset 0
>             ) OFFSET 0
>           );
> 
> 
> 
> 
> (Those "offset 0" are in there to protect us from planner regressions
> we saw when moving to 8.4. When we move to 9, they can hopefully go
> away.)
> 
> 
> On the production server, this returns a fairly accurate plan:
> 
> 
> 
> QUERY PLAN
>              
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=843.59..1447.90 rows=243 width=8) (actual
> time=0.044..0.045 rows=1 loops=1)
>    ->  HashAggregate  (cost=843.59..845.59 rows=200 width=4) (actual
> time=0.027..0.027 rows=1 loops=1)
>          ->  Limit  (cost=0.02..823.90 rows=1575 width=4) (actual
> time=0.024..0.025 rows=1 loops=1)
>                ->  Nested Loop  (cost=0.02..823.90 rows=1575 width=4)
> (actual time=0.023..0.024 rows=1 loops=1)
>                      ->  HashAggregate  (cost=0.02..0.03 rows=1
> width=4) (actual time=0.005..0.005 rows=1 loops=1)
>                            ->  Limit  (cost=0.00..0.01 rows=1 width=0)
> (actual time=0.001..0.002 rows=1 loops=1)
>                                  ->  Result  (cost=0.00..0.01 rows=1
> width=0) (actual time=0.000..0.000 rows=1 loops=1)
>                      ->  Index Scan using users_user_groups_idx on
> users  (cost=0.00..804.18 rows=1575 width=8) (actual time=0.017..0.018
> rows=1 loops=1)
>                            Index Cond: (users.user_group_id =
> (607547))
>    ->  Index Scan using machines_sid_un on machines  (cost=0.00..3.00
> rows=1 width=12) (actual time=0.015..0.015 rows=1 loops=1)
>          Index Cond: (machines.user_id = users.id)
>  Total runtime: 0.121 ms
> (12 rows)
> 
> 
> 
> 
> On the QA server, things are not so accurate. It doesn't hurt the
> timing of this simplified query much, but when put into the actual
> query, the row estimation being off by 6 orders of magnitude really
> throws the planning in the wrong direction. The plan on the QA server
> is:
> 
> 
> 
> QUERY PLAN
>               
> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=1887.16..3671.20 rows=1192462 width=8) (actual
> time=0.049..0.051 rows=1 loops=1)
>    ->  HashAggregate  (cost=1887.16..1889.16 rows=200 width=4) (actual
> time=0.032..0.033 rows=1 loops=1)
>          ->  Limit  (cost=0.02..1868.20 rows=1517 width=4) (actual
> time=0.027..0.029 rows=1 loops=1)
>                ->  Nested Loop  (cost=0.02..1868.20 rows=1517 width=4)
> (actual time=0.027..0.028 rows=1 loops=1)
>                      ->  HashAggregate  (cost=0.02..0.03 rows=1
> width=4) (actual time=0.008..0.008 rows=1 loops=1)
>                            ->  Limit  (cost=0.00..0.01 rows=1 width=0)
> (actual time=0.001..0.001 rows=1 loops=1)
>                                  ->  Result  (cost=0.00..0.01 rows=1
> width=0) (actual time=0.001..0.001 rows=1 loops=1)
>                      ->  Index Scan using users_user_groups_idx on
> users  (cost=0.00..1849.20 rows=1517 width=8) (actual
> time=0.015..0.016 rows=1 loops=1)
>                            Index Cond: (users.user_group_id =
> (607547))
>    ->  Index Scan using machines_sid_un on machines  (cost=0.00..8.90
> rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
>          Index Cond: (machines.user_id = users.id)
>  Total runtime: 0.148 ms
> (12 rows)
> 
> 
> 
> 
> 
> 
> The problem here (I think) seems to be that the QA server believes
> that running a nested loop over 200 users.id values and joining that
> against machines.user_id will result in >1M rows. The production
> servers sees this more accurately as the nearly 1:1 relationship that
> it is.
> 
> 
> The reason I wonder if this might be a bug is because if I change the
> obtuse clause "WHERE user_group_id IN (select 607547 offset 0)" to
> simply "where user_group_id in (607547)" then the plan collapses to
> the same plan on both servers:
> 
> 
> explain analyze SELECT machines.quota_purchased
>         FROM machines
>         WHERE NOT deleted AND machines.user_id IN (
>             SELECT id FROM users WHERE user_group_id IN (
>              607547         
>             ) OFFSET 0
>           );
> 
> 
> 
> 
> 
>  QUERY PLAN
>          
> -------------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=82.27..636.70 rows=62 width=8) (actual
> time=0.042..0.043 rows=1 loops=1)
>    ->  HashAggregate  (cost=82.27..82.88 rows=61 width=4) (actual
> time=0.024..0.025 rows=1 loops=1)
>          ->  Limit  (cost=0.00..81.51 rows=61 width=4) (actual
> time=0.017..0.018 rows=1 loops=1)
>                ->  Index Scan using users_user_groups_idx on users
>  (cost=0.00..81.51 rows=61 width=4) (actual time=0.015..0.016 rows=1
> loops=1)
>                      Index Cond: (user_group_id = 607547)
>    ->  Index Scan using machines_sid_un on machines  (cost=0.00..9.07
> rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
>          Index Cond: (machines.user_id = users.id)
>  Total runtime: 0.106 ms
> (8 rows)
> 
> 
> 
> 
> But, as I understand it, that remaining OFFSET 0 should keep the
> planner from caring that it's a single value or the result of a
> subselect. 
> 
> 
> 
> 
> Anyway, anything I can check? The hardware is different so the
> configuration files are different, but they're not that different in
> anything other than effective_cache_size. Certainly not different in
> any way I can think that would affect this kind of planning
> mis-estimation. 

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general