It has been noted in various places that ANALYZE does not estimate n_distinct values accurately on large tables. For example, here: https://www.postgresql-archive.org/serious-under-estimation-of-n-distinct-for-clustered-distributions-tt5738290.html#a5738707 ANALYZE uses a random page-level sample to perform as well as it does. With the default statistics target of 100, that might correspond to a 30000 row sample. If the table cardinality is very large, the extrapolated estimate might have a lot of error. If there are many nulls in a column, then the effective sample size is even smaller. If the values are correlated with heap order, then the n_distinct will be under-estimated more. What may be surprising is just how bad things are on a table containing a few 10's of millions of rows, with some columns that have realistic data distributions. Such is the case on the Internet Movie Database, where with the default_statistics_target = 100 I see errors as large as 190x, with several columns having errors in the 50x - 75x range, a bunch that are off by roughly 10x, and a bevy that are off by 2x. If that is not surprising, then it may yet be surprising that increasing the default_statistics_target to 1000 doesn't reduce some of the big errors all that much. On a few columns that I noticed first, there were still errors of 50x, 30x, and 35x, and these may not be the largest. The errors are still huge, even at 1000. n_distinct is used as the basis for non-frequent equality selectivity, but it's also used for the threshold of how many mcv's will be collected, so it's pretty important for n_distinct to be accurate to get good plans. Estimating distinct values from a sample is an intrinsically hard problem. There are distributions that will require sampling nearly all of the data in order to get an estimate that has a 5% error. Well, we had an RDS Hackathon last week, and I wrote a little script that would estimate n_distinct accurately. It will be accurate because it scans all rows, but it gains efficiency because it can estimate many or all columns in a single pass. It can do many columns on the same scan by using the HyperLogLog algorithm to estimate the distinct counts with a small amount of memory (1.2kB per column), and it further gains performance by using a type-specific hash function so that it can avoid a dynamic dispatch at run time. If there are so many columns that it can't be handled in a single pass, then it chunks up the columns into groups, and handles as many columns in a single pass as it can. Since it was a hackathon, I added a hack-around for handling partitioned tables, where if the child table has a name that is a prefix of the name of its parent, then it will be analyzed, too. At the moment it is implemented as a SQL script that defines a schema, a couple of tables, a view, and a couple of pl/pgsql functions. One function, currently named 'analyze', accepts a comma-separated list of schema names, a comma-separated list of table names, and a flag that says whether you want to apply it to foreign tables that could deal with an approximate count distinct function call (external tables not tested yet), and it calculates all the n_distinct values and does an ALTER TABLE ... SET column (n_distinct = v), as well as inflating the statistics target for any column with a high null_frac. The other function resets all the column overrides back to their defaults for the specified schemas and tables, in case you want to go back to the way it was. Since setting pg_statistic requires superuser privs, you run the aux_stats.analyze function, and then you run ANALYZE. Before running ANALYZE, you can view the stats as currently reflected in pg_statistic, and what the new stats will be. We would like to offer this up to the community if there is interest in it, and if anyone would like to work with me on it to polish it up and try it on some large databases to see what we get, and to see how long it takes, etc. I envision this either as a script that we could post alongside some of the helpful scripts that we have in our documentation, or possibly as an extension. It would be easy to package it as an extension, but it's pretty simple to use as just a script that you install with \i. I'm open to suggestion. So, if you're interested in trying this out and making some helpful suggestions, please respond. After enough people kick the tires and are happy with it, I'll post the code. /Jim Finnerty ----- Jim Finnerty, AWS, Amazon Aurora PostgreSQL -- Sent from: http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html