Expected accuracy of planner statistics

Casey Duncan <casey@xxxxxxxxxxx> · Thu, 28 Sep 2006 15:19:46 -0700

I have some databases that have grown significantly over time (as  
databases do). As the databases have grown, I have noticed that the  
statistics have grown less and less accurate. In particular, the  
n_distinct values have become many OOM too small for certain foreign  
key columns. Predictably this leads to poor query plans.

The databases in question were all using the default stats target  
value, so naturally the thing to do is to increase that and see what  
happens. First I'll show you one table in question:

qa_full=# \d fk
                   Table "public.fk"
     Column   |            Type             |   Modifiers
--------------+-----------------------------+---------------
fk_id         | bigint                      | not null
st_id         | bigint                      | not null
is_positive   | boolean                     | not null
mc_id         | character varying(20)       | not null
matching_seed | character varying(20)       |
ft_id         | character varying(20)       |
s_title       | text                        | not null
a_summary     | text                        | not null
date_created  | timestamp without time zone | default now()
qx_id         | bigint                      |
Indexes:
    "fk_pkey" PRIMARY KEY, btree (fk_id)
    "fk_st_mc_id_idx" UNIQUE, btree (st_id, mc_id)
    "fk_date_created_is_positive_idx" btree (is_positive, date_created)
    "fk_st_id_idx" btree (st_id)
Foreign-key constraints:
    "fk_qx_id_fkey" FOREIGN KEY (qx_id) REFERENCES st(st_id) ON  
DELETE RESTRICT
    "fk_st_id_fkey" FOREIGN KEY (st_id) REFERENCES st(st_id) ON  
DELETE RESTRICT

qa_full=# select count(*) from fk;
   count
-----------
195555889

Here are the n_distinct stats on the st_id column with stock stats  
settings:

qa_full=# select n_distinct from pg_stats where tablename='fk' and  
attname='st_id'';
  attname  | n_distinct
-----------+-------------
st_id      |      14910

here's the actual distinct count:

qa_full=# select count(distinct st_id) from fk;
count
----------
15191387
(1 row)

Here's what it looks like after turning the stats target up to 100:

qa_full=# select n_distinct from pg_stats where tablename='fk' and  
attname='st_id'';
  attname  | n_distinct
-----------+-------------
st_id      |     136977

Still way off (3 OOM), so let's pull out the stops and go for 1000:

qa_full=# select n_distinct from pg_stats where tablename='fk' and  
attname='st_id'';
  attname  | n_distinct
-----------+-------------
st_id      |      860796

Better, but still way off. Here's more of the pg_stats row for the  
curious with the stats target at 1000:

schemaname        | public
tablename         | fk
attname           | st_id
null_frac         | 0
avg_width         | 8
n_distinct        | 860796
most_common_vals  |  
{9822972459012807,81553350123749183,50260420266636724,16953859416556337, 
57992478091506908,6789385517968759,13155841310992808,4649594156182905,11 
950505984130111,19815690615418387,23232929805154508,24940819255590358,25 
304517086243633,30084673952005845,33845252828401578,36510232790970904,44 
301350711321256,47572440754042499,66302045808587415,106949745150210138,7 
948257888859857,11709841786637953,12034360925626832,17311819170902574,21 
933556169120032,31401742852411043,37178443803282644,39714175315169346,42 
699954975194688,63648700912541567,73785794393665562,...many elided..}
most_common_freqs |  
{7.33333e-05,6.66667e-05,5.33333e-05,5e-05,5e-05,4.66667e-05,4.66667e-05 
, 
4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05, 
4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05, 
4.33333e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05, 
4e-05,4e-05,4e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666 
7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666 
7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.33333e-05,3.33333e-05,3.3333 
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333 
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3e-05,3e-05,3e-05, 
3e-05,3e-05,3e-05,3e-05,3e-05,3e-05,..many elided..}
histogram_bounds  |  
{9474697855526,186642098833097,425502410065792,655064117100237,917344884 
999940,1135224280975580,1510900775316064,1919850381534192,23918286327044 
65,2773745714634569,3197981109338899,3601128214604953,3887435029566307,4 
289757501117626,4604286546172963,5030605000015434,5410915764179364,57126 
62986537560,6096452674229658,6531206443844232,6761515475182966,692428185 
0823004,7145897868348599,7357502317108796,7537560231072453,7737194605867 
515,7923617661480232,8094845122681350,8304911973154200,8504211340608556, 
8735469559703009,9008968782181381,9233161779966219,..many elided..}
correlation       | 0.770339

The correlation is likely high here because this table has been  
clustered on this column in the past. I don't know if that  
contributes to the n_distinct inaccuracy, I don't know if I have the  
patience to reorder the table to find out ;^)

Note that new st_ids are also being added all the time, at a rate  
roughly proportional to fk rows (fk rows being added more  
frequently). So actually a fractional value for the n_distinct here  
would be more ideal. The docs hint that analyze will sometimes decide  
to use a fractional (negative) value. What triggers that?

I was also trying to figure out how big the sample really is. Does a  
stats target of 1000 mean 1000 rows sampled? If the sample really is  
a fixed number of rows, it would seem to my naive eyes that sampling  
a fraction of the rows (like 0.1% or something) would be better  
(especially in cases like this), but maybe it already tries to do that.

Any insights appreciated.

-Casey