Two slightly different queries produce same results, one takes 1/4 the time.. bug in planner?

Andrew Schmidt <andrew@xxxxxxxxxxxxx> · Thu, 10 Nov 2005 17:08:52 -0500

Summary:

Two queries that are nearly identical AND use the exact same indices 
take much different times in execution.   Both queries produce the exact 
same results.  One takes 1072ms to execute, the other 262ms.  Although I 
have a fix, it seems more of a hack, and I'd also like to know why it's 
doing it.

Details:

I run a hockey pool website that uses mysql 4x with innodb.  This week 
I've been trying out postgresql with the site to check out the 
performance difference.   Standard select statements seem to be about 
twice as fast with postgres as opposed to innodb.   I really wasn't 
expecting this, but was impressed none-the-less.  I run on FreeBSD 5.4.

However, I've run into a problem where one query took about twice as 
long as innodb.  Some investigation and playing around with the query, I 
ended up with a slightly different query but was about 3 times as fast 
as innodb (and 5 times faster than the original query).  I didn't add 
any indices and the EXPLAIN ANALYZE doesn't show any new indices being 
used with the new query.  The order of which tables are read first was 
changed in the query planner and also it uses HashAggregate vs 
GroupAggregate.. not entirely sure what that means.

Table structure:

team - list of teams for each pool,
team_players - list of player_id's for each team
player - list of nhl players
player_updates - for each nhl player the stats on each day they played 
(ie, crosby on oct 10, got 1 goal and 2 assists)

There are other tables but these are the ones pertinent to my problem.

So a query to find the total goals per team for an entire pool is fairly 
straightforward.  Due to the fact different points are awarded if the 
nhl player is a goalie, I group by team_id / is_goalie.

SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS 
total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id,is_goalie;

Explain analyze of this query:

QUERY 
PLAN                                                                            

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual 
time=1186.973..1432.548 rows=626 loops=1)
  ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual 
time=1186.237..1309.562 rows=40928 loops=1)
        Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)
        ->  Hash Join  (cost=174.52..4812.28 rows=44453 width=13) 
(actual time=51.099..400.119 rows=40928 loops=1)
              Hash Cond: ("outer".player_id = "inner".player_id)
              ->  Hash Join  (cost=30.85..3840.72 rows=5731 width=17) 
(actual time=8.303..115.834 rows=5005 loops=1)
                    Hash Cond: ("outer".player_id = "inner".player_id)
                    ->  Nested Loop  (cost=0.00..3723.90 rows=5731 
width=8) (actual time=0.108..71.378 rows=5005 loops=1)
                          ->  Index Scan using team_pool_id on team t  
(cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313 
loops=1)
                                Index Cond: (pool_id = 21699)
                          ->  Index Scan using team_id_asdas on 
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual 
time=0.029..0.132 rows=16 loops=313)
                                Index Cond: (tp.team_id = "outer".team_id)
                    ->  Hash  (cost=28.48..28.48 rows=948 width=9) 
(actual time=8.140..8.140 rows=948 loops=1)
                          ->  Seq Scan on player pl  (cost=0.00..28.48 
rows=948 width=9) (actual time=0.043..4.470 rows=948 loops=1)
              ->  Hash  (cost=130.14..130.14 rows=5414 width=8) (actual 
time=42.690..42.690 rows=5414 loops=1)
                    ->  Seq Scan on player_updates pu  
(cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..21.410 
rows=5414 loops=1)
Total runtime: 1436.617 ms
(17 rows)

Time: 1443.127 ms

I tried out enforcing the planner to use the indices on player_updates 
and player, but the query took longer. 

Now the slight modification where I put the pool_id in the JOIN of team 
and team_players (but also keep the check in the where clause):

SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = 
21699  <--- ** added pool_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id,is_goalie;

The explain analyze:

QUERY 
PLAN                                                                               

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate  (cost=271.72..280.21 rows=566 width=13) (actual 
time=478.783..481.127 rows=626 loops=1)
  ->  Hash Join  (cost=89.66..267.48 rows=566 width=13) (actual 
time=124.015..328.387 rows=40928 loops=1)
        Hash Cond: ("outer".player_id = "inner".player_id)
        ->  Seq Scan on player_updates pu  (cost=0.00..130.14 rows=5414 
width=8) (actual time=0.021..16.934 rows=5414 loops=1)
        ->  Hash  (cost=89.48..89.48 rows=73 width=17) (actual 
time=123.872..123.872 rows=5005 loops=1)
              ->  Hash Join  (cost=53.16..89.48 rows=73 width=17) 
(actual time=79.843..104.031 rows=5005 loops=1)
                    Hash Cond: ("outer".player_id = "inner".player_id)
                    ->  Seq Scan on player pl  (cost=0.00..28.48 
rows=948 width=9) (actual time=0.040..3.103 rows=948 loops=1)
                    ->  Hash  (cost=52.97..52.97 rows=73 width=8) 
(actual time=79.740..79.740 rows=5005 loops=1)
                          ->  Nested Loop  (cost=0.00..52.97 rows=73 
width=8) (actual time=0.098..58.145 rows=5005 loops=1)
                                ->  Index Scan using team_pool_id on 
team t  (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556 
rows=313 loops=1)
                                      Index Cond: ((pool_id = 21699) 
AND (pool_id = 21699))
                                ->  Index Scan using team_id_asdas on 
team_players tp  (cost=0.00..8.24 rows=21 width=8) (actual 
time=0.021..0.099 rows=16 loops=313)
                                      Index Cond: (tp.team_id = 
"outer".team_id)
Total runtime: 482.791 ms
(15 rows)

Time: 489.270 ms

So there are no new indices being used, but the order in which the 
tables are read makes it much faster.  Also if the pool_id check is 
removed from the WHERE clause (so it'd only be in the join clause),  it 
goes back to the slower query planner.
Originally I thought the slowdown was because it wasn't using the 
player_id index on player and player_updates, but that doesn't seem to 
be the case. 

Both queries return the exact same result (626 rows, same stats)

Conclusion:

Is this a bug or my misunderstanding?   If it's a bug I can submit a 
more detailed step by step email.  If it's just my misunderstanding, 
then can someone explain why?  I've always been under the impression the 
join should only contain the fields pertinent to the join.  Also does it 
need the pool_id again in the where clause when it's in the join clause.

thanks a bunch,

-- Andrew

I apologize if the formatting gets messed up.

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org