On ons, 2011-08-31 at 10:42 +0300, Peter Eisentraut wrote: > I don't have an answer for you, but this report looks suspiciously > similar to the one I posted the other day at > <http://archives.postgresql.org/pgsql-hackers/2011-08/msg01224.php>, > which, now that I think about it, also manifested itself after the > upgrade to 8.4.8. See this thread: http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php It looks like there are a number of users affected by this. > > On tis, 2011-08-30 at 15:24 -0700, Ben Chobot wrote: > > We recently took a copy of our production data (running on 8.4.2), > > scrubbed many data fields, and then loaded it onto a qa server > > (running 8.4.8). We're seeing some odd planner performance that I > > think might be a bug, though I'm hoping it's just idiocy on my part. > > I've analyzed things and looked into pg_stats and it seems as if the > > relevant columns have about the same statistics. > > > > > > I've managed to simplify the query, but if I make it any simpler, then > > the two servers end up with the same good plan. The query is down to: > > > > > > SELECT machines.quota_purchased > > FROM machines > > WHERE NOT deleted AND machines.user_id IN ( > > SELECT id FROM users WHERE user_group_id IN ( > > select 607547 offset 0 > > ) OFFSET 0 > > ); > > > > > > > > > > (Those "offset 0" are in there to protect us from planner regressions > > we saw when moving to 8.4. When we move to 9, they can hopefully go > > away.) > > > > > > On the production server, this returns a fairly accurate plan: > > > > > > > > QUERY PLAN > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > > Nested Loop (cost=843.59..1447.90 rows=243 width=8) (actual > > time=0.044..0.045 rows=1 loops=1) > > -> HashAggregate (cost=843.59..845.59 rows=200 width=4) (actual > > time=0.027..0.027 rows=1 loops=1) > > -> Limit (cost=0.02..823.90 rows=1575 width=4) (actual > > time=0.024..0.025 rows=1 loops=1) > > -> Nested Loop (cost=0.02..823.90 rows=1575 width=4) > > (actual time=0.023..0.024 rows=1 loops=1) > > -> HashAggregate (cost=0.02..0.03 rows=1 > > width=4) (actual time=0.005..0.005 rows=1 loops=1) > > -> Limit (cost=0.00..0.01 rows=1 width=0) > > (actual time=0.001..0.002 rows=1 loops=1) > > -> Result (cost=0.00..0.01 rows=1 > > width=0) (actual time=0.000..0.000 rows=1 loops=1) > > -> Index Scan using users_user_groups_idx on > > users (cost=0.00..804.18 rows=1575 width=8) (actual time=0.017..0.018 > > rows=1 loops=1) > > Index Cond: (users.user_group_id = > > (607547)) > > -> Index Scan using machines_sid_un on machines (cost=0.00..3.00 > > rows=1 width=12) (actual time=0.015..0.015 rows=1 loops=1) > > Index Cond: (machines.user_id = users.id) > > Total runtime: 0.121 ms > > (12 rows) > > > > > > > > > > On the QA server, things are not so accurate. It doesn't hurt the > > timing of this simplified query much, but when put into the actual > > query, the row estimation being off by 6 orders of magnitude really > > throws the planning in the wrong direction. The plan on the QA server > > is: > > > > > > > > QUERY PLAN > > > > ----------------------------------------------------------------------------------------------------------------------------------------------------------- > > Nested Loop (cost=1887.16..3671.20 rows=1192462 width=8) (actual > > time=0.049..0.051 rows=1 loops=1) > > -> HashAggregate (cost=1887.16..1889.16 rows=200 width=4) (actual > > time=0.032..0.033 rows=1 loops=1) > > -> Limit (cost=0.02..1868.20 rows=1517 width=4) (actual > > time=0.027..0.029 rows=1 loops=1) > > -> Nested Loop (cost=0.02..1868.20 rows=1517 width=4) > > (actual time=0.027..0.028 rows=1 loops=1) > > -> HashAggregate (cost=0.02..0.03 rows=1 > > width=4) (actual time=0.008..0.008 rows=1 loops=1) > > -> Limit (cost=0.00..0.01 rows=1 width=0) > > (actual time=0.001..0.001 rows=1 loops=1) > > -> Result (cost=0.00..0.01 rows=1 > > width=0) (actual time=0.001..0.001 rows=1 loops=1) > > -> Index Scan using users_user_groups_idx on > > users (cost=0.00..1849.20 rows=1517 width=8) (actual > > time=0.015..0.016 rows=1 loops=1) > > Index Cond: (users.user_group_id = > > (607547)) > > -> Index Scan using machines_sid_un on machines (cost=0.00..8.90 > > rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1) > > Index Cond: (machines.user_id = users.id) > > Total runtime: 0.148 ms > > (12 rows) > > > > > > > > > > > > > > The problem here (I think) seems to be that the QA server believes > > that running a nested loop over 200 users.id values and joining that > > against machines.user_id will result in >1M rows. The production > > servers sees this more accurately as the nearly 1:1 relationship that > > it is. > > > > > > The reason I wonder if this might be a bug is because if I change the > > obtuse clause "WHERE user_group_id IN (select 607547 offset 0)" to > > simply "where user_group_id in (607547)" then the plan collapses to > > the same plan on both servers: > > > > > > explain analyze SELECT machines.quota_purchased > > FROM machines > > WHERE NOT deleted AND machines.user_id IN ( > > SELECT id FROM users WHERE user_group_id IN ( > > 607547 > > ) OFFSET 0 > > ); > > > > > > > > > > > > QUERY PLAN > > > > ------------------------------------------------------------------------------------------------------------------------------------------------- > > Nested Loop (cost=82.27..636.70 rows=62 width=8) (actual > > time=0.042..0.043 rows=1 loops=1) > > -> HashAggregate (cost=82.27..82.88 rows=61 width=4) (actual > > time=0.024..0.025 rows=1 loops=1) > > -> Limit (cost=0.00..81.51 rows=61 width=4) (actual > > time=0.017..0.018 rows=1 loops=1) > > -> Index Scan using users_user_groups_idx on users > > (cost=0.00..81.51 rows=61 width=4) (actual time=0.015..0.016 rows=1 > > loops=1) > > Index Cond: (user_group_id = 607547) > > -> Index Scan using machines_sid_un on machines (cost=0.00..9.07 > > rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1) > > Index Cond: (machines.user_id = users.id) > > Total runtime: 0.106 ms > > (8 rows) > > > > > > > > > > But, as I understand it, that remaining OFFSET 0 should keep the > > planner from caring that it's a single value or the result of a > > subselect. > > > > > > > > > > Anyway, anything I can check? The hardware is different so the > > configuration files are different, but they're not that different in > > anything other than effective_cache_size. Certainly not different in > > any way I can think that would affect this kind of planning > > mis-estimation. > > > > -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general