Re: Two slightly different queries produce same results,

Richard Huxton <dev@xxxxxxxxxxxx> · Fri, 11 Nov 2005 09:05:37 +0000

Andrew Schmidt wrote:
However, I've run into a problem where one query took about twice as 
long as innodb.  Some investigation and playing around with the query, I 
ended up with a slightly different query but was about 3 times as fast 
as innodb (and 5 times faster than the original query).  I didn't add 
any indices and the EXPLAIN ANALYZE doesn't show any new indices being 
used with the new query.  The order of which tables are read first was 
changed in the query planner and also it uses HashAggregate vs 
GroupAggregate.. not entirely sure what that means.

It's a different method of aggregating data (calculating your sum()). 
The HashAggregate puts values into buckets and works best when you have 
a fairly small number of different values (so each value can get its own 
bucket). I think - bear in mind I'm just another user, so I'm not 
familiar with the internals.

The different plans are because PG thinks it will get different numbers 
of rows...

SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS 
total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id,is_goalie;

Explain analyze of this query:

                          ->  Index Scan using team_pool_id on team t  
(cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313 
loops=1)
                                Index Cond: (pool_id = 21699)

With this index scan you have one condition and PG expects 391 matching 
rows (it gets 313, so that's a pretty good estimate).

SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = 
21699  <--- ** added pool_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id,is_goalie;

The explain analyze:
                                ->  Index Scan using team_pool_id on 
team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556 
rows=313 loops=1)
                                      Index Cond: ((pool_id = 21699) AND 
(pool_id = 21699))

Here it sees two conditions, and PG is expecting that will restrict 
things further. It's clearly not expecting the same condition twice, but 
is presumably treating it something like "a>-4 AND a<4". This actually 
makes the row estimate wrong.

So - if PG is getting this bit wrong why is query 2 faster?
Looking at the row estimates for case 1 we can see that the final cases 
are badly wrong.

> GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual
> time=1186.973..1432.548 rows=626 loops=1)

Here it's got the number of rows badly wrong - if you actually ended up 
with 44,000 rows then maybe this would be the best plan. This is the 
last step though, so what happens before?

>   ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual
> time=1186.237..1309.562 rows=40928 loops=1)
>         Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)

We're about right here. The problem must be in the last step, and 
looking at the row estimates it seems to think that no summarising is 
actually going to take place.

I think the "position = 'G'" bit is confusing it. Well, we can test that:

1. Just group by "pl.position" - remove the test. Do the estimated and 
actual rows match now?

2. Rewrite the position='G' as a CASE structure

3. Do the position='G' part in a sub-query and join to that.

The results of those should show whether this is actually the problem.
--
  Richard Huxton
  Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match