Re: strange row count estimates with conditions on multiple column

Tomas Vondra <tv@xxxxxxxx> · Wed, 17 Nov 2010 06:22:06 +0100

Dne 17.11.2010 05:22, Tom Lane napsal(a):
> Tomas Vondra <tv@xxxxxxxx> writes:
>> Estimate for condition
>>    ... WHERE (col_a BETWEEN 33 AND 33)
>> is about 10k rows, which is quite precise. On the other side estimate
>> for condition
>>    ... WHERE (col_b BETWEEN 33 AND 33)
>> is 1 row, which is very imprecise (actual value is about 5000).
> 
> That's an artifact of your test case.  There are exactly 100 distinct
> values in col_a, which means that they all fit into the most_common_vals
> array (assuming you're using default_statistics_target = 100).  So the
> planner actually has complete information about the contents of col_a,
> modulo some sampling inaccuracy about the precise frequency of each
> value.  It should be expected to produce pretty good estimates for the
> sorts of expressions it can estimate, and it does.  In col_b, there are
> 200 values, so they can't be represented by most_common_vals, and in
> fact ANALYZE notices that none of them are really much more common than
> any other.  So it throws up its hands and doesn't generate an MCV list
> at all, just a histogram.  That doesn't provide a lot of foothold for
> the range estimator to give an exact estimate for a very narrow range.

Yes, I understand why MCV is not used in case of col_b, and I do
understand that the estimate may not be precise. But I'm wondering
what's a better estimate in such cases - 1, 5000, any constant, or
something related to a the histogram?

I'd probably vote for a size of the the histogram bucket (or maybe a
half of it), as it seems like a good upper bound, and I don't see why
any constant (1, 5000, whatever) should be better. But maybe I'm missing
something.

> If you look closely at what EXPLAIN is saying, that expression expands
> as
> 
>  Filter: ((col_b >= 33) AND ((col_b)::numeric <= 33.1))

Aha! Haven't noticed that missing '::numeric' - it'd be nice to make
this more visible. I bet I'm not the only one who haven't noticed this.

> So the things being compared to aren't the same, and it doesn't
> recognize this as a range constraint, and you get the dumb
> product-of-independent-inequality-probabilities estimate.
> 
> The "exact" estimate you got in the first case is more of a lucky guess
> than anything else, too.  It does realize that it's got a range
> constraint in that case, but it has no stats about the value of the
> expression col_b::numeric.  So you're just getting a default estimate
> that by coincidence matches the correct answer.

Aaaargh, it looked like a perfect estimate :-(

BTW I think the default estimate used to be 1000, so it was changed in
one of the 8.x releases? Can you point me to the docs? I've even tried
to find that in the sources, but unsuccessfully.

regards
Tomas

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general