ANALYZE accuracy problems for n_distinct, and a solution

Jim Finnerty <jfinnert@xxxxxxxxxx> · Tue, 22 Jan 2019 14:13:48 -0700 (MST)

It has been noted in various places that ANALYZE does not estimate n_distinct
values accurately on large tables.  For example, here:

https://www.postgresql-archive.org/serious-under-estimation-of-n-distinct-for-clustered-distributions-tt5738290.html#a5738707

ANALYZE uses a random page-level sample to perform as well as it does.  With
the default statistics target of 100, that might correspond to a 30000 row
sample.  If the table cardinality is very large, the extrapolated estimate
might have a lot of error.  If there are many nulls in a column, then the
effective sample size is even smaller.  If the values are correlated with
heap order, then the n_distinct will be under-estimated more.

What may be surprising is just how bad things are on a table containing a
few 10's of millions of rows, with some columns that have realistic data
distributions.  Such is the case on the Internet Movie Database, where with
the default_statistics_target = 100 I see errors as large as 190x, with
several columns having errors in the 50x - 75x range, a bunch that are off
by roughly 10x, and a bevy that are off by 2x.

If that is not surprising, then it may yet be surprising that increasing the
default_statistics_target to 1000 doesn't reduce some of the big errors all
that much.  On a few columns that I noticed first, there were still errors
of 50x, 30x, and 35x, and these may not be the largest.  The errors are
still huge, even at 1000.

n_distinct is used as the basis for non-frequent equality selectivity, but
it's also used for the threshold of how many mcv's will be collected, so
it's pretty important for n_distinct to be accurate to get good plans.

Estimating distinct values from a sample is an intrinsically hard problem. 
There are distributions that will require sampling nearly all of the data in
order to get an estimate that has a 5% error.

Well, we had an RDS Hackathon last week, and I wrote a little script that
would estimate n_distinct accurately.  It will be accurate because it scans
all rows, but it gains efficiency because it can estimate many or all
columns in a single pass.  It can do many columns on the same scan by using
the HyperLogLog algorithm to estimate the distinct counts with a small
amount of memory (1.2kB per column), and it further gains performance by
using a type-specific hash function so that it can avoid a dynamic dispatch
at run time.  If there are so many columns that it can't be handled in a
single pass, then it chunks up the columns into groups, and handles as many
columns in a single pass as it can.

Since it was a hackathon, I added a hack-around for handling partitioned
tables, where if the child table has a name that is a prefix of the name of
its parent, then it will be analyzed, too.

At the moment it is implemented as a SQL script that defines a schema, a
couple of tables, a view, and a couple of pl/pgsql functions.  One function,
currently named 'analyze', accepts a comma-separated list of schema names, a
comma-separated list of table names, and a flag that says whether you want
to apply it to foreign tables that could deal with an approximate count
distinct function call (external tables not tested yet), and it calculates
all the n_distinct values and does an ALTER TABLE ... SET column (n_distinct
= v), as well as inflating the statistics target for any column with a high
null_frac.  The other function resets all the column overrides back to their
defaults for the specified schemas and tables, in case you want to go back
to the way it was.

Since setting pg_statistic requires superuser privs, you run the
aux_stats.analyze function, and then you run ANALYZE.  Before running
ANALYZE, you can view the stats as currently reflected in pg_statistic, and
what the new stats will be.

We would like to offer this up to the community if there is interest in it,
and if anyone would like to work with me on it to polish it up and try it on
some large databases to see what we get, and to see how long it takes, etc. 
I envision this either as a script that we could post alongside some of the
helpful scripts that we have in our documentation, or possibly as an
extension.  It would be easy to package it as an extension, but it's pretty
simple to use as just a script that you install with \i.  I'm open to
suggestion.

So, if you're interested in trying this out and making some helpful
suggestions, please respond.  After enough people kick the tires and are
happy with it, I'll post the code.

    /Jim Finnerty

-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html