Re: strange row count estimates with conditions on multiple column

Tom Lane <tgl@xxxxxxxxxxxxx> · Tue, 16 Nov 2010 23:22:45 -0500

Tomas Vondra <tv@xxxxxxxx> writes:
> Estimate for condition
>    ... WHERE (col_a BETWEEN 33 AND 33)
> is about 10k rows, which is quite precise. On the other side estimate
> for condition
>    ... WHERE (col_b BETWEEN 33 AND 33)
> is 1 row, which is very imprecise (actual value is about 5000).

That's an artifact of your test case.  There are exactly 100 distinct
values in col_a, which means that they all fit into the most_common_vals
array (assuming you're using default_statistics_target = 100).  So the
planner actually has complete information about the contents of col_a,
modulo some sampling inaccuracy about the precise frequency of each
value.  It should be expected to produce pretty good estimates for the
sorts of expressions it can estimate, and it does.  In col_b, there are
200 values, so they can't be represented by most_common_vals, and in
fact ANALYZE notices that none of them are really much more common than
any other.  So it throws up its hands and doesn't generate an MCV list
at all, just a histogram.  That doesn't provide a lot of foothold for
the range estimator to give an exact estimate for a very narrow range.

> I've been playing with this a little bit, and I've noticed another
> 'strange' thing. When I rewrite the condition like this

>    ... WHERE (col_b BETWEEN 32.9 AND 33.1)

> so that the range is not of zero length, and everything works fine, the
> estimate is exactly 5000. But when I increase the lower bound a bit

>    ... WHERE (col_b BETWEEN 33 AND 33.1)

> the estimate suddenly jumps to 276667.

If you look closely at what EXPLAIN is saying, that expression expands
as

 Filter: ((col_b >= 33) AND ((col_b)::numeric <= 33.1))

So the things being compared to aren't the same, and it doesn't
recognize this as a range constraint, and you get the dumb
product-of-independent-inequality-probabilities estimate.

The "exact" estimate you got in the first case is more of a lucky guess
than anything else, too.  It does realize that it's got a range
constraint in that case, but it has no stats about the value of the
expression col_b::numeric.  So you're just getting a default estimate
that by coincidence matches the correct answer.

			regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general