I'm running a rather complex query and noticed a peculiarity in the usage
of statistics that seriously affects the plan generated. I can extract the
relevant bit:
modmine-r9=# select * from pg_stats where tablename = 'geneflankingregion' AND attname IN ('distance', 'direction');
schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
------------+--------------------+-----------+-----------+-----------+------------+----------------------------------+------------------------------------------------+------------------+-------------
public | geneflankingregion | distance | 0 | 6 | 5 | {5.0kb,0.5kb,1.0kb,2.0kb,10.0kb} | {0.201051,0.200798,0.200479,0.199088,0.198583} | | 0.197736
public | geneflankingregion | direction | 0 | 10 | 2 | {downstream,upstream} | {0.500719,0.499281} | | 0.495437
(2 rows)
modmine-r9=# SELECT COUNT(*) FROM geneflankingregion;
count
--------
455020
(1 row)
modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE distance = '10.0kb' AND direction = 'upstream';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on geneflankingregion (cost=0.00..15507.30 rows=45115 width=213) (actual time=0.053..181.764 rows=45502 loops=1)
Filter: ((distance = '10.0kb'::text) AND (direction = 'upstream'::text))
Total runtime: 227.245 ms
(3 rows)
modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE LOWER(distance) = '10.0kb' AND LOWER(direction) = 'upstream';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on geneflankingregion
(cost=66.95..88.77 rows=11 width=213)
(actual time=207.555..357.359 rows=45502 loops=1)
Recheck Cond: ((lower(distance) = '10.0kb'::text) AND (lower(direction) = 'upstream'::text))
-> BitmapAnd
(cost=66.95..66.95 rows=11 width=0)
(actual time=205.978..205.978 rows=0 loops=1)
-> Bitmap Index Scan on geneflankingregion__distance_equals
(cost=0.00..31.34 rows=2275 width=0)
(actual time=79.380..79.380 rows=91004 loops=1)
Index Cond: (lower(distance) = '10.0kb'::text)
-> Bitmap Index Scan on geneflankingregion__direction_equals
(cost=0.00..35.35 rows=2275 width=0)
(actual time=124.639..124.639 rows=227510 loops=1)
Index Cond: (lower(direction) = 'upstream'::text)
Total runtime: 401.740 ms
(8 rows)
When I wrap the fields in the constraints in a LOWER() function, the
planner stops looking at the statistics and makes a wild guess, even
though it is very obvious from just looking what the result should be.
Embedded in a much larger query, the inaccuracy in the number of rows (11
instead of 45502) causes major planning problems. Also, why does the
BitmapAnd say zero actual rows?
I understand this probably isn't Priority No. 1, and there are some
interesting corner cases when n_distinct is higher than the histogram
width, but would it be possible to fix this one up?
Matthew
--
I would like to think that in this day and age people would know better than
to open executables in an e-mail. I'd also like to be able to flap my arms
and fly to the moon. -- Tim Mullen
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance