Re: Setting Statistics on Functional Indexes

Tom Lane <tgl@xxxxxxxxxxxxx> · Fri, 26 Oct 2012 17:08:18 -0400

Shaun Thomas <sthomas@xxxxxxxxxxxxxxxx> writes:
> But I just noticed the lag in your response. :) It turns out, even 
> though I was substituting 2012-10-24 or 2012-10-25, what I really meant 
> was current_date. That does make all the difference, actually.

Ah.  [ pokes at that for awhile... ]  OK, this has nothing to do with
functional indexes, and everything to do with the edge-case behavior of
scalarltsel.  What you've got is a histogram whose last entry
(corresponding to the highest observed value of the date) is
current_date, and the question is what we should assume when estimating
how many rows have a value >= that.  The answer of course is "one, plus
any duplicates" ... but we don't know how many duplicates there are,
and what we do know is it's not a particularly large number because the
value isn't present in the most-common-values stats.  So the code there
assumes there aren't any dups.

Once you have enough histogram resolution for current_date to show up
as the next-to-last as well as the last histogram entry, then of course
the estimate gets a lot better, since we can now tell that there's at
least one histogram bin's worth of duplicates.

Interestingly, this is a case where the get_actual_variable_range patch
(commit 40608e7f, which appeared in 9.0) makes the results worse.
Before that, there was a (very arbitrary) lower bound on what we'd
believe as the selectivity of a >= condition, but now, when we know the
actual upper limit of the variable, we don't clamp the result that way.
I think the clamp must have been saving you in your previous version,
because it more-or-less-accidentally accounted for the end value not
being unique.

So the bottom line is that this is a case where you need a lot of
resolution in the histogram.  I'm not sure there's anything good
we can do to avoid that.   I spent a bit of time thinking about whether
we could use n_distinct to get some idea of how many duplicates there
might be for the endpoint value, but n_distinct is unreliable enough
that I can't develop a lot of faith in such a thing.  Or we could just
arbitarily assume some fraction-of-a-histogram-bin's worth of
duplicates, but that would make the results worse for some people.

			regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance