Odd misprediction

Claudio Freire <klaussfreire@xxxxxxxxx> · Fri, 16 Sep 2011 11:50:40 -0300

It's not an issue for me (it's not really impacting performance), but
since it was odd I thought I might ask.

I have this supermegaquery:

SELECT
       t.date AS status_date, lu.id AS memberid, lu.username AS
username, u.url AS url, ub.url_pattern AS urlpattern, lu.email AS
email,
       lu.birth_date AS birthdate, lu.creation_date AS creationdate,
s.name AS state, co.name AS country,
       opd.survey_id AS originalSurvey, c.name AS city , lu.confirmed
AS confirmed , pd.name AS action , sd.duration AS loi
FROM   tracks t
       LEFT JOIN surveyduration_v sd
       ON     sd.member_id = t.member_id
       AND    sd.survey_id = 5936
       INNER JOIN all_users_v lu
       ON     lu.id              = t.member_id
       AND    lu.panel_source_id = 1
       LEFT JOIN track_status ts
       ON     ts.id = t.track_status_id
       LEFT JOIN partners p
       ON     p.id = t.partner_id
       LEFT JOIN urls u
       ON     u.id        = t.url_id
       AND    u.survey_id = 5936
       LEFT JOIN url_batchs ub
       ON     u.url_batch_id = ub.id
       LEFT JOIN states s
       ON     lu.state_id = s.id
       LEFT JOIN cities c
       ON     lu.city_id = c.id
       LEFT JOIN countries co
       ON     lu.country_id = co.id
       LEFT JOIN partner_deliveries pd
       ON     pd.id                  = t.partner_delivery_id
       AND    t.partner_id IS NOT NULL
       LEFT JOIN partner_deliveries opd
       ON     opd.id = pd.originator_id
WHERE  t.survey_id   = 5936
AND    t.track_status_id IN (5)

With the views

CREATE OR REPLACE VIEW surveyduration_v AS
 SELECT date_part('epoch'::text, t.date - tl2.date) / 60::double
precision AS duration, t.member_id, t.survey_id
   FROM tracks t
   JOIN track_logs tl2 ON t.id = tl2.track_id
  WHERE tl2.track_status_id = 8 AND t.track_status_id = 7;

CREATE OR REPLACE VIEW all_users_v AS
         SELECT 1 AS panel_source_id, livra_users.id,
livra_users.birth_date, livra_users.creation_date, livra_users.email,
livra_users.first_name, livra_users.last_name, livra_users.username,
livra_users.image_link, livra_users.confirmed,
livra_users.is_panelist, livra_users.unregistered, livra_users.reason,
livra_users.privacy, livra_users.sex, livra_users.site,
livra_users.country_id, livra_users.state_id, livra_users.city_id,
livra_users.last_activity_date, livra_users.partner_id,
livra_users.survey_id, livra_users.panelist_update,
livra_users.panelist_percentage
           FROM livra_users
UNION ALL
         SELECT 2 AS panel_source_id, - external_users.id AS id,
NULL::timestamp without time zone AS birth_date,
external_users.creation_date, external_users.email, NULL::character
varying AS first_name, NULL::character varying AS last_name,
external_users.username, NULL::character varying AS image_link, true
AS confirmed, external_users.is_panelist, false AS unregistered,
NULL::integer AS reason, 0 AS privacy, NULL::integer AS sex,
external_users.site, external_users.country_id, NULL::integer AS
state_id, NULL::integer AS city_id, NULL::timestamp without time zone
AS last_activity_date, NULL::integer AS partner_id,
external_users.survey_id, NULL::bigint AS panelist_update,
NULL::smallint AS panelist_percentage
           FROM external_users;

Server is 9.0.3 running on linux

The BIG tables are tracks, track_logs and urls, all > 30M rows.

One detail that could be related is that tracks.member_id is an
undeclared (denoramlized) foreign key to livra_users.

The resulting plan is:

"Hash Left Join  (cost=51417.93..974563.27 rows=2241518 width=1276)"
"  Hash Cond: ("*SELECT* 1".country_id = co.id)"
"  ->  Hash Left Join  (cost=51415.40..941722.50 rows=2241518 width=1271)"
"        Hash Cond: ("*SELECT* 1".state_id = s.id)"
"        ->  Hash Left Join  (cost=51373.45..910859.68 rows=2241518 width=1263)"
"              Hash Cond: (t.partner_delivery_id = pd.id)"
"              Join Filter: (t.partner_id IS NOT NULL)"
"              ->  Hash Left Join  (cost=32280.78..854175.26
rows=2241518 width=1256)"
"                    Hash Cond: ("*SELECT* 1".city_id = c.id)"
"                    ->  Hash Join  (cost=24183.20..792841.63
rows=2241518 width=1249)"
"                          Hash Cond: ("*SELECT* 1".id = t.member_id)"
"                          ->  Append  (cost=0.00..148254.38
rows=3008749 width=168)"
"                                ->  Subquery Scan on "*SELECT* 1"
(cost=0.00..140223.96 rows=3008748 width=168)"
"                                      ->  Seq Scan on livra_users
(cost=0.00..110136.48 rows=3008748 width=168)"
"                                ->  Subquery Scan on "*SELECT* 2"
(cost=0.00..8030.42 rows=1 width=60)"
"                                      ->  Result  (cost=0.00..8030.41
rows=1 width=60)"
"                                            One-Time Filter: false"
"                                            ->  Seq Scan on
external_users  (cost=0.00..8030.41 rows=1 width=60)"
"                          ->  Hash  (cost=24181.34..24181.34 rows=149
width=188)"
"                                ->  Hash Left Join
(cost=21650.42..24181.34 rows=149 width=188)"
"                                      Hash Cond: (u.url_batch_id = ub.id)"
"                                      ->  Nested Loop Left Join
(cost=20828.08..23355.84 rows=149 width=115)"
"                                            ->  Merge Left Join
(cost=20828.08..20841.04 rows=149 width=44)"
"                                                  Merge Cond:
(t.member_id = t.member_id)"
"                                                  ->  Sort
(cost=435.90..436.27 rows=149 width=32)"
"                                                        Sort Key: t.member_id"
"                                                        ->  Index
Scan using idx_tracks_survey_id_track_status_id on tracks t
(cost=0.00..430.52 rows=149 width=32)"
"                                                              Index
Cond: ((survey_id = 5936) AND (track_status_id = 5))"
"                                                  ->  Sort
(cost=20392.18..20398.28 rows=2440 width=20)"
"                                                        Sort Key: t.member_id"
"                                                        ->  Nested
Loop  (cost=0.00..20254.90 rows=2440 width=20)"
"                                                              ->
Index Scan using idx_tracks_survey_id_track_status_id on tracks t
(cost=0.00..2010.03 rows=712 width=20)"
"
Index Cond: ((survey_id = 5936) AND (track_status_id = 7))"
"                                                              ->
Index Scan using idx_track_logs_track_id on track_logs tl2
(cost=0.00..25.59 rows=3 width=16)"
"
Index Cond: (tl2.track_id = t.id)"
"
Filter: (tl2.track_status_id = 8)"
"                                            ->  Index Scan using
urls_pkey on urls u  (cost=0.00..16.87 rows=1 width=87)"
"                                                  Index Cond: (u.id =
t.url_id)"
"                                                  Filter: (u.survey_id = 5936)"
"                                      ->  Hash  (cost=637.15..637.15
rows=14815 width=81)"
"                                            ->  Seq Scan on
url_batchs ub  (cost=0.00..637.15 rows=14815 width=81)"
"                    ->  Hash  (cost=4578.37..4578.37 rows=281537 width=15)"
"                          ->  Seq Scan on cities c
(cost=0.00..4578.37 rows=281537 width=15)"
"              ->  Hash  (cost=18270.17..18270.17 rows=65799 width=19)"
"                    ->  Hash Left Join  (cost=8842.48..18270.17
rows=65799 width=19)"
"                          Hash Cond: (pd.originator_id = opd.id)"
"                          ->  Seq Scan on partner_deliveries pd
(cost=0.00..8019.99 rows=65799 width=19)"
"                          ->  Hash  (cost=8019.99..8019.99 rows=65799 width=8)"
"                                ->  Seq Scan on partner_deliveries
opd  (cost=0.00..8019.99 rows=65799 width=8)"
"        ->  Hash  (cost=24.20..24.20 rows=1420 width=16)"
"              ->  Seq Scan on states s  (cost=0.00..24.20 rows=1420 width=16)"
"  ->  Hash  (cost=1.68..1.68 rows=68 width=13)"
"        ->  Seq Scan on countries co  (cost=0.00..1.68 rows=68 width=13)"

The curious bit is the rowcount (2241518) which is grossly
misestimated. It gets to that rowcount when joining the all_users_v
view with tracks, both partial results are estimated at ~150 rows
(more or less on target), it's a join of int PK to int column, so I
cannot imagine how that join could result in 2M rows, what is pg
thinking to get to that number?

Even a full cross product couldn't get that high.

Performance isn't impacted, the plan, even with the misestimation, is
near optimal. But I can imagine this kind of misestimation wreaking
havoc in other situations.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance