Andrew Schmidt wrote:
However, I've run into a problem where one query took about twice as
long as innodb. Some investigation and playing around with the query, I
ended up with a slightly different query but was about 3 times as fast
as innodb (and 5 times faster than the original query). I didn't add
any indices and the EXPLAIN ANALYZE doesn't show any new indices being
used with the new query. The order of which tables are read first was
changed in the query planner and also it uses HashAggregate vs
GroupAggregate.. not entirely sure what that means.
It's a different method of aggregating data (calculating your sum()).
The HashAggregate puts values into buckets and works best when you have
a fairly small number of different values (so each value can get its own
bucket). I think - bear in mind I'm just another user, so I'm not
familiar with the internals.
The different plans are because PG thinks it will get different numbers
of rows...
SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS
total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id,is_goalie;
Explain analyze of this query:
-> Index Scan using team_pool_id on team t
(cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313
loops=1)
Index Cond: (pool_id = 21699)
With this index scan you have one condition and PG expects 391 matching
rows (it gets 313, so that's a pretty good estimate).
SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id =
21699 <--- ** added pool_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id,is_goalie;
The explain analyze:
-> Index Scan using team_pool_id on
team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
rows=313 loops=1)
Index Cond: ((pool_id = 21699) AND
(pool_id = 21699))
Here it sees two conditions, and PG is expecting that will restrict
things further. It's clearly not expecting the same condition twice, but
is presumably treating it something like "a>-4 AND a<4". This actually
makes the row estimate wrong.
So - if PG is getting this bit wrong why is query 2 faster?
Looking at the row estimates for case 1 we can see that the final cases
are badly wrong.
> GroupAggregate (cost=8742.52..9853.85 rows=44453 width=13) (actual
> time=1186.973..1432.548 rows=626 loops=1)
Here it's got the number of rows badly wrong - if you actually ended up
with 44,000 rows then maybe this would be the best plan. This is the
last step though, so what happens before?
> -> Sort (cost=8742.52..8853.66 rows=44453 width=13) (actual
> time=1186.237..1309.562 rows=40928 loops=1)
> Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)
We're about right here. The problem must be in the last step, and
looking at the row estimates it seems to think that no summarising is
actually going to take place.
I think the "position = 'G'" bit is confusing it. Well, we can test that:
1. Just group by "pl.position" - remove the test. Do the estimated and
actual rows match now?
2. Rewrite the position='G' as a CASE structure
3. Do the position='G' part in a sub-query and join to that.
The results of those should show whether this is actually the problem.
--
Richard Huxton
Archonet Ltd
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match