Hi all,
I have an application that runs in production in multiple instances, and
on one of these the performance of certain queries suddenly became truly
abysmal. I basically know why, but I would much appreciate if I could
obtain a deeper understanding of the selectivity function involved and
any possible means of making Postgres choose a better plan. In the
following I have tried to boil the problem down to something manageable:
The schema contains two tables, t1 and t2.
t2 has two fields, an id and a tag, and it contains 146 rows that are
unique.
t1 has two fields, a value and a foreign key referring to t2.id, and it
contains 266177 rows.
The application retrieves the rows in t1 that match a specific tag in
t2, and it turned out that the contents of t1 were distributed in a very
lopsided way, where more than 90% of the rows refer to one of two tags
from t2:
EXPLAIN SELECT(*) FROM t1 WHERE t2_id = '<some_id>'
Index Scan using t1_t2_id_idx on t1 (cost=0.42..7039.67 rows=103521
width=367)
Index Cond: (t2_id = '<some_id>'::text)
The row count estimate is exactly as expected; about 39% of the rows
refer to that specific tag.
What the application actually does is
EXPLAIN SELECT(*) FROM t1 INNER JOIN t2 ON t1.t2_id = t2.id WHERE t2.tag
= '<some_tag>'
Nested Loop (cost=0.69..3152.53 rows=1824 width=558)
-> Index Scan using t2_tag_idx ON t2 (cost=0.27..2.29 rows=1
width=191)
Index Cond: (tag = '<some_tag>'::text)
-> Index Scan using t1_t2_id_idx on t1 (cost=0.42..3058.42 rows=9182
width=367)
Index Cond: (t2_id = t2.id)
The estimate for the number of rows in the result (1824) is way too low,
and that leads to bad plans and queries involving more joins on the
tables that run about 1000x slower than they should.
I have currently rewritten the application code to do two queries; one
to retrieve the id from t2 that matches the given tag and one to
retrieve the rows from t1, and that's a usable workaround but not
something we really like doing as a permanent solution. Fiddling with
the various statistics related knobs seems to make no difference, but is
there be some other way I can make Postgres assume high selectivity for
certain tag values? Am I just SOL with the given schema?
Any pointers to information about how to handle potentially lopsided
data like this are highly welcome.
Best regards,
Mikkel Lauritsen
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance