Search Postgresql Archives

Re: Two slightly different queries produce same results,

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Richard,

While in the process of responding to your email I found out what was doing it. It was the grouping of the team_players team_id instead of the team team_id.

Here's some table numbers:

team => 31720 rows,
team_players => 464896 rows,
player => 948 rows
player_updates => 5414 rows

The query:
SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id;


It's grouping by the team_players tp.team_id, change the group to t.team_id and bam! 5x faster.

thanks for the help,

-- Andrew


*******
This was my response to you Richard before I found out what it was.
*******
I removed position from the query, but still got the same query times. I summed up the pl.player_id just to make sure it was reading the player table.


EXPLAIN ANALYSE
SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=8584.45..9364.83 rows=22390 width=12) (actual time=894.364..1140.884 rows=313 loops=1) -> Sort (cost=8584.45..8695.59 rows=44453 width=12) (actual time=893.373..1013.989 rows=40928 loops=1)
        Sort Key: tp.team_id
-> Hash Join (cost=174.52..4701.14 rows=44453 width=12) (actual time=50.433..328.788 rows=40928 loops=1)
              Hash Cond: ("outer".player_id = "inner".player_id)
-> Hash Join (cost=30.85..3840.72 rows=5731 width=12) (actual time=7.279..110.578 rows=5005 loops=1)
                    Hash Cond: ("outer".player_id = "inner".player_id)
-> Nested Loop (cost=0.00..3723.90 rows=5731 width=8) (actual time=0.106..69.528 rows=5005 loops=1) -> Index Scan using team_pool_id on team t (cost=0.00..397.90 rows=391 width=4) (actual time=0.058..2.313 rows=313 loops=1)
                                Index Cond: (pool_id = 21699)
-> Index Scan using team_id_asdas on team_players tp (cost=0.00..8.24 rows=21 width=8) (actual time=0.027..0.122 rows=16 loops=313)
                                Index Cond: (tp.team_id = "outer".team_id)
-> Hash (cost=28.48..28.48 rows=948 width=4) (actual time=7.121..7.121 rows=948 loops=1) -> Seq Scan on player pl (cost=0.00..28.48 rows=948 width=4) (actual time=0.042..3.930 rows=948 loops=1) -> Hash (cost=130.14..130.14 rows=5414 width=8) (actual time=43.070..43.070 rows=5414 loops=1) -> Seq Scan on player_updates pu (cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..22.410 rows=5414 loops=1)
Total runtime: 1144.059 ms
(17 rows)

---
VS fast query

EXPLAIN ANALYSE
SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = 21699
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=270.31..278.80 rows=566 width=12) (actual time=409.428..410.551 rows=313 loops=1) -> Hash Join (cost=89.66..266.06 rows=566 width=12) (actual time=120.341..278.310 rows=40928 loops=1)
        Hash Cond: ("outer".player_id = "inner".player_id)
-> Seq Scan on player_updates pu (cost=0.00..130.14 rows=5414 width=8) (actual time=0.019..16.833 rows=5414 loops=1) -> Hash (cost=89.48..89.48 rows=73 width=12) (actual time=120.209..120.209 rows=5005 loops=1) -> Hash Join (cost=53.16..89.48 rows=73 width=12) (actual time=78.672..102.434 rows=5005 loops=1)
                    Hash Cond: ("outer".player_id = "inner".player_id)
-> Seq Scan on player pl (cost=0.00..28.48 rows=948 width=4) (actual time=0.040..3.092 rows=948 loops=1) -> Hash (cost=52.97..52.97 rows=73 width=8) (actual time=78.577..78.577 rows=5005 loops=1) -> Nested Loop (cost=0.00..52.97 rows=73 width=8) (actual time=0.100..57.729 rows=5005 loops=1) -> Index Scan using team_pool_id on team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.050..1.591 rows=313 loops=1) Index Cond: ((pool_id = 21699) AND (pool_id = 21699)) -> Index Scan using team_id_asdas on team_players tp (cost=0.00..8.24 rows=21 width=8) (actual time=0.021..0.099 rows=16 loops=313) Index Cond: (tp.team_id = "outer".team_id)
Total runtime: 411.629 ms
(15 rows)

Time: 417.726 ms

I ran ANALYZE on each of the tables, just to be sure, but there was no changes.

Also, here's some table info:

team => 31720 rows,
team_players => 464896 rows,
player => 948 rows
player_updates => 5414 rows


thanks,

-- Andrew

The explain analyze:
-> Index Scan using team_pool_id on team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556 rows=313 loops=1) Index Cond: ((pool_id = 21699) AND (pool_id = 21699))


Here it sees two conditions, and PG is expecting that will restrict things further. It's clearly not expecting the same condition twice, but is presumably treating it something like "a>-4 AND a<4". This actually makes the row estimate wrong.

So - if PG is getting this bit wrong why is query 2 faster?
Looking at the row estimates for case 1 we can see that the final cases are badly wrong.

> GroupAggregate  (cost=8742.52..9853.85 rows=44453 width=13) (actual
> time=1186.973..1432.548 rows=626 loops=1)

Here it's got the number of rows badly wrong - if you actually ended up with 44,000 rows then maybe this would be the best plan. This is the last step though, so what happens before?

>   ->  Sort  (cost=8742.52..8853.66 rows=44453 width=13) (actual
> time=1186.237..1309.562 rows=40928 loops=1)
>         Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)

We're about right here. The problem must be in the last step, and looking at the row estimates it seems to think that no summarising is actually going to take place.

I think the "position = 'G'" bit is confusing it. Well, we can test that:

1. Just group by "pl.position" - remove the test. Do the estimated and actual rows match now?

2. Rewrite the position='G' as a CASE structure

3. Do the position='G' part in a sub-query and join to that.




The results of those should show whether this is actually the problem.
--
  Richard Huxton
  Archonet Ltd

.


--
_____________________________________________
G l o b a l   D i a g n o s t i c s    I n c.
Andrew Schmidt            t.416-304-0049 x206
aschmidt@xxxxxxxxxxxxx         f.866-697-8726
_____________________________________________

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux