On Fri, Sep 16, 2011 at 17:50, Claudio Freire <klaussfreire@xxxxxxxxx> wrote: > It's not an issue for me (it's not really impacting performance), but > since it was odd I thought I might ask. > > I have this supermegaquery: > > SELECT > t.date AS status_date, lu.id AS memberid, lu.username AS > username, u.url AS url, ub.url_pattern AS urlpattern, lu.email AS > email, > lu.birth_date AS birthdate, lu.creation_date AS creationdate, > s.name AS state, co.name AS country, > opd.survey_id AS originalSurvey, c.name AS city , lu.confirmed > AS confirmed , pd.name AS action , sd.duration AS loi > FROM tracks t > LEFT JOIN surveyduration_v sd > ON sd.member_id = t.member_id > AND sd.survey_id = 5936 > INNER JOIN all_users_v lu > ON lu.id = t.member_id > AND lu.panel_source_id = 1 > LEFT JOIN track_status ts > ON ts.id = t.track_status_id > LEFT JOIN partners p > ON p.id = t.partner_id > LEFT JOIN urls u > ON u.id = t.url_id > AND u.survey_id = 5936 > LEFT JOIN url_batchs ub > ON u.url_batch_id = ub.id > LEFT JOIN states s > ON lu.state_id = s.id > LEFT JOIN cities c > ON lu.city_id = c.id > LEFT JOIN countries co > ON lu.country_id = co.id > LEFT JOIN partner_deliveries pd > ON pd.id = t.partner_delivery_id > AND t.partner_id IS NOT NULL > LEFT JOIN partner_deliveries opd > ON opd.id = pd.originator_id > WHERE t.survey_id = 5936 > AND t.track_status_id IN (5) > > With the views > > CREATE OR REPLACE VIEW surveyduration_v AS > SELECT date_part('epoch'::text, t.date - tl2.date) / 60::double > precision AS duration, t.member_id, t.survey_id > FROM tracks t > JOIN track_logs tl2 ON t.id = tl2.track_id > WHERE tl2.track_status_id = 8 AND t.track_status_id = 7; > > CREATE OR REPLACE VIEW all_users_v AS > SELECT 1 AS panel_source_id, livra_users.id, > livra_users.birth_date, livra_users.creation_date, livra_users.email, > livra_users.first_name, livra_users.last_name, livra_users.username, > livra_users.image_link, livra_users.confirmed, > livra_users.is_panelist, livra_users.unregistered, livra_users.reason, > livra_users.privacy, livra_users.sex, livra_users.site, > livra_users.country_id, livra_users.state_id, livra_users.city_id, > livra_users.last_activity_date, livra_users.partner_id, > livra_users.survey_id, livra_users.panelist_update, > livra_users.panelist_percentage > FROM livra_users > UNION ALL > SELECT 2 AS panel_source_id, - external_users.id AS id, > NULL::timestamp without time zone AS birth_date, > external_users.creation_date, external_users.email, NULL::character > varying AS first_name, NULL::character varying AS last_name, > external_users.username, NULL::character varying AS image_link, true > AS confirmed, external_users.is_panelist, false AS unregistered, > NULL::integer AS reason, 0 AS privacy, NULL::integer AS sex, > external_users.site, external_users.country_id, NULL::integer AS > state_id, NULL::integer AS city_id, NULL::timestamp without time zone > AS last_activity_date, NULL::integer AS partner_id, > external_users.survey_id, NULL::bigint AS panelist_update, > NULL::smallint AS panelist_percentage > FROM external_users; > > Server is 9.0.3 running on linux > > The BIG tables are tracks, track_logs and urls, all > 30M rows. > > One detail that could be related is that tracks.member_id is an > undeclared (denoramlized) foreign key to livra_users. > > The resulting plan is: > > "Hash Left Join (cost=51417.93..974563.27 rows=2241518 width=1276)" > " Hash Cond: ("*SELECT* 1".country_id = co.id)" > " -> Hash Left Join (cost=51415.40..941722.50 rows=2241518 width=1271)" > " Hash Cond: ("*SELECT* 1".state_id = s.id)" > " -> Hash Left Join (cost=51373.45..910859.68 rows=2241518 width=1263)" > " Hash Cond: (t.partner_delivery_id = pd.id)" > " Join Filter: (t.partner_id IS NOT NULL)" > " -> Hash Left Join (cost=32280.78..854175.26 > rows=2241518 width=1256)" > " Hash Cond: ("*SELECT* 1".city_id = c.id)" > " -> Hash Join (cost=24183.20..792841.63 > rows=2241518 width=1249)" > " Hash Cond: ("*SELECT* 1".id = t.member_id)" > " -> Append (cost=0.00..148254.38 > rows=3008749 width=168)" > " -> Subquery Scan on "*SELECT* 1" > (cost=0.00..140223.96 rows=3008748 width=168)" > " -> Seq Scan on livra_users > (cost=0.00..110136.48 rows=3008748 width=168)" > " -> Subquery Scan on "*SELECT* 2" > (cost=0.00..8030.42 rows=1 width=60)" > " -> Result (cost=0.00..8030.41 > rows=1 width=60)" > " One-Time Filter: false" > " -> Seq Scan on > external_users (cost=0.00..8030.41 rows=1 width=60)" > " -> Hash (cost=24181.34..24181.34 rows=149 > width=188)" > " -> Hash Left Join > (cost=21650.42..24181.34 rows=149 width=188)" > " Hash Cond: (u.url_batch_id = ub.id)" > " -> Nested Loop Left Join > (cost=20828.08..23355.84 rows=149 width=115)" > " -> Merge Left Join > (cost=20828.08..20841.04 rows=149 width=44)" > " Merge Cond: > (t.member_id = t.member_id)" > " -> Sort > (cost=435.90..436.27 rows=149 width=32)" > " Sort Key: t.member_id" > " -> Index > Scan using idx_tracks_survey_id_track_status_id on tracks t > (cost=0.00..430.52 rows=149 width=32)" > " Index > Cond: ((survey_id = 5936) AND (track_status_id = 5))" > " -> Sort > (cost=20392.18..20398.28 rows=2440 width=20)" > " Sort Key: t.member_id" > " -> Nested > Loop (cost=0.00..20254.90 rows=2440 width=20)" > " -> > Index Scan using idx_tracks_survey_id_track_status_id on tracks t > (cost=0.00..2010.03 rows=712 width=20)" > " > Index Cond: ((survey_id = 5936) AND (track_status_id = 7))" > " -> > Index Scan using idx_track_logs_track_id on track_logs tl2 > (cost=0.00..25.59 rows=3 width=16)" > " > Index Cond: (tl2.track_id = t.id)" > " > Filter: (tl2.track_status_id = 8)" > " -> Index Scan using > urls_pkey on urls u (cost=0.00..16.87 rows=1 width=87)" > " Index Cond: (u.id = > t.url_id)" > " Filter: (u.survey_id = 5936)" > " -> Hash (cost=637.15..637.15 > rows=14815 width=81)" > " -> Seq Scan on > url_batchs ub (cost=0.00..637.15 rows=14815 width=81)" > " -> Hash (cost=4578.37..4578.37 rows=281537 width=15)" > " -> Seq Scan on cities c > (cost=0.00..4578.37 rows=281537 width=15)" > " -> Hash (cost=18270.17..18270.17 rows=65799 width=19)" > " -> Hash Left Join (cost=8842.48..18270.17 > rows=65799 width=19)" > " Hash Cond: (pd.originator_id = opd.id)" > " -> Seq Scan on partner_deliveries pd > (cost=0.00..8019.99 rows=65799 width=19)" > " -> Hash (cost=8019.99..8019.99 rows=65799 width=8)" > " -> Seq Scan on partner_deliveries > opd (cost=0.00..8019.99 rows=65799 width=8)" > " -> Hash (cost=24.20..24.20 rows=1420 width=16)" > " -> Seq Scan on states s (cost=0.00..24.20 rows=1420 width=16)" > " -> Hash (cost=1.68..1.68 rows=68 width=13)" > " -> Seq Scan on countries co (cost=0.00..1.68 rows=68 width=13)" > > The curious bit is the rowcount (2241518) which is grossly > misestimated. It gets to that rowcount when joining the all_users_v > view with tracks, both partial results are estimated at ~150 rows > (more or less on target), it's a join of int PK to int column, so I > cannot imagine how that join could result in 2M rows, what is pg > thinking to get to that number? > > Even a full cross product couldn't get that high. > > Performance isn't impacted, the plan, even with the misestimation, is > near optimal. But I can imagine this kind of misestimation wreaking > havoc in other situations. Looks like these reports could be related: http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php http://archives.postgresql.org/pgsql-hackers/2011-08/msg01388.php Tom Lane tracked these down to a likely cause, but AFAICT this has not been fixed yet. Regards, Marti -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance