Re: n_distinct off by a factor of 1000

Klaudie Willis <Klaudie.Willis@xxxxxxxxxxxxxx> · Wed, 24 Jun 2020 07:30:05 +0000

show default_statistics_target; --> 500
ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000;

Here is the output of the "ANALYZE VERBOSE bigtable;"
INFO:  analyzing "public.bigtables" inheritance tree
INFO:  "bigtable_y2018": scanned 622250 of 10661013 pages, containing 11994670 live rows and 5091 dead rows; 622250 rows in sample, 205504753 estimated total rows
INFO:  "bigtable_y2019": scanned 520159 of 8911886 pages, containing 10017582 live rows and 6148 dead rows; 520159 rows in sample, 171631268 estimated total rows
INFO:  "bigtable_y2020": scanned 357591 of 6126616 pages, containing 7031238 live rows and 1534 dead rows; 357591 rows in sample, 120466385 estimated total rows
INFO:  analyzing "public.bigtable_y2018"
INFO:  "bigtable_y2018": scanned 1500000 of 10661013 pages, containing 28915115 live rows and 12589 dead rows; 1500000 rows in sample, 205509611 estimated total rows
INFO:  analyzing "public.bigtable_y2019"
INFO:  "bigtable_y2019": scanned 1500000 of 8911886 pages, containing 28888514 live rows and 17778 dead rows; 1500000 rows in sample, 171634096 estimated total rows
INFO:  analyzing "public.bigtable_y2020"
INFO:  "bigtable_y2020": scanned 1500000 of 6126616 pages, containing 29488967 live rows and 6330 dead rows; 1500000 rows in sample, 120445051 estimated total rows
INFO:  analyzing "public.bigtable_y2021"
INFO:  "bigtable_y2021": scanned 1 of 1 pages, containing 8 live rows and 0 dead rows; 8 rows in sample, 8 estimated total rows
ANALYZE

On the comment from Adrian:

SELECT
( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
n_distinct,
array_length(most_common_vals,1) n_mcv,
array_length(histogram_bounds,1) n_hist,
tablename,
FROM pg_stats
WHERE
schemaname = 'public'
AND attname like 'instrumentid_ref'

frac_MCV;n_distinct; n_mcv; n_hist;tablename
0.9205394	122160	2140	5001	"bigtable"
0.9203018	124312	1736	5001	"bigtable_y2018"
0.9258158	113846	2107	5001	"bigtable_y2020"
0.875        -0.375      2              "bigtable_y2021"
0.92304045	118267	2204	5001	"bigtable_y2019"

select count(distinct instrumentid_ref) from bigtable --> 33 385 922
Bigtables instrumentid_ref is underestimated by 300X even when statistics target of the column is 5000;  Pretty weird.

K

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
 On Tuesday, June 23, 2020 3:07 PM, Klaudie Willis <Klaudie.Willis@xxxxxxxxxxxxxx> wrote:

I didn't run it with "verbose" but otherwise, yes, several times.  I can do it again with verbose if you are interested in the output.  Just give me some time.  500M rows 50 columns, is no small job :)

K

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 23, 2020 2:51 PM, Ron <ronljohnsonjr@xxxxxxxxx> wrote:

Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?

On 6/23/20 7:42 AM, Klaudie Willis
      wrote:
Friends,

I run Postgresql 12.3, on Windows. I have just discovered a
        pretty significant problem with Postgresql and my data.  I have
        a large table, 500M rows, 50 columns. It is split in 3
        partitions by Year.  In addition to the primary key, one of the
        columns is indexed, and I do lookups on this.

Select * from bigtable b where b.instrument_ref in
        (x,y,z,...)
limit 1000

It responded well with sub-second response, and it uses the
        index of the column.  However, when I changed it to:

Select * from bigtable b where b.instrument_ref in (x,y,z,)
limit 10000 -- (notice 10K now)

The planner decided to do a full table scan on the entire
        500M row table! And that did not work very well.  First I had no
        clue as to why it did so, and when I disabled sequential scan
        the query immediately returned.  But I should not have to do so.

I got my first hint of why this problem occurs when I looked
        at the statistics.  For the column in question, "instrument_ref"
        the statistics claimed it to be:

The default_statistics_target=500, and analyze has been run.
select * from pg_stats where attname like 'instr%_ref'; --
        Result: 40.000
select count(distinct instrumentid_ref) from bigtable --
        Result: 33 385 922 (!!)

That is an astonishing difference of almost a 1000X.  

When the planner only thinks there are 40K different values,
        then it makes sense to switch to table scan in order to fill the
        limit=10.000.  But it is wrong, very wrong, an the query returns
        in 100s of seconds instead of a few.

I have tried to increase the statistics target to 5000, and
        it helps, but it reduces the error to 100X.  Still crazy high.

I understand that this is a known problem.  I have read
        previous posts about it, still I have never seen anyone reach
        such a high difference factor. 

I have considered these fixes:
- hardcode the statistics to a particular ratio of the total
        number of rows
- randomize the rows more, so that it does not suffer from
        page clustering.  However, this has probably other implications

Feel free to comment :)

K

-- 
Angular momentum makes the world go 'round.