I have some databases that have grown significantly over time (as
databases do). As the databases have grown, I have noticed that the
statistics have grown less and less accurate. In particular, the
n_distinct values have become many OOM too small for certain foreign
key columns. Predictably this leads to poor query plans.
The databases in question were all using the default stats target
value, so naturally the thing to do is to increase that and see what
happens. First I'll show you one table in question:
qa_full=# \d fk
Table "public.fk"
Column | Type | Modifiers
--------------+-----------------------------+---------------
fk_id | bigint | not null
st_id | bigint | not null
is_positive | boolean | not null
mc_id | character varying(20) | not null
matching_seed | character varying(20) |
ft_id | character varying(20) |
s_title | text | not null
a_summary | text | not null
date_created | timestamp without time zone | default now()
qx_id | bigint |
Indexes:
"fk_pkey" PRIMARY KEY, btree (fk_id)
"fk_st_mc_id_idx" UNIQUE, btree (st_id, mc_id)
"fk_date_created_is_positive_idx" btree (is_positive, date_created)
"fk_st_id_idx" btree (st_id)
Foreign-key constraints:
"fk_qx_id_fkey" FOREIGN KEY (qx_id) REFERENCES st(st_id) ON
DELETE RESTRICT
"fk_st_id_fkey" FOREIGN KEY (st_id) REFERENCES st(st_id) ON
DELETE RESTRICT
qa_full=# select count(*) from fk;
count
-----------
195555889
Here are the n_distinct stats on the st_id column with stock stats
settings:
qa_full=# select n_distinct from pg_stats where tablename='fk' and
attname='st_id'';
attname | n_distinct
-----------+-------------
st_id | 14910
here's the actual distinct count:
qa_full=# select count(distinct st_id) from fk;
count
----------
15191387
(1 row)
Here's what it looks like after turning the stats target up to 100:
qa_full=# select n_distinct from pg_stats where tablename='fk' and
attname='st_id'';
attname | n_distinct
-----------+-------------
st_id | 136977
Still way off (3 OOM), so let's pull out the stops and go for 1000:
qa_full=# select n_distinct from pg_stats where tablename='fk' and
attname='st_id'';
attname | n_distinct
-----------+-------------
st_id | 860796
Better, but still way off. Here's more of the pg_stats row for the
curious with the stats target at 1000:
schemaname | public
tablename | fk
attname | st_id
null_frac | 0
avg_width | 8
n_distinct | 860796
most_common_vals |
{9822972459012807,81553350123749183,50260420266636724,16953859416556337,
57992478091506908,6789385517968759,13155841310992808,4649594156182905,11
950505984130111,19815690615418387,23232929805154508,24940819255590358,25
304517086243633,30084673952005845,33845252828401578,36510232790970904,44
301350711321256,47572440754042499,66302045808587415,106949745150210138,7
948257888859857,11709841786637953,12034360925626832,17311819170902574,21
933556169120032,31401742852411043,37178443803282644,39714175315169346,42
699954975194688,63648700912541567,73785794393665562,...many elided..}
most_common_freqs |
{7.33333e-05,6.66667e-05,5.33333e-05,5e-05,5e-05,4.66667e-05,4.66667e-05
,
4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,
4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,4.33333e-05,
4.33333e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,
4e-05,4e-05,4e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666
7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.6666
7e-05,3.66667e-05,3.66667e-05,3.66667e-05,3.33333e-05,3.33333e-05,3.3333
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.3333
3e-05,3.33333e-05,3.33333e-05,3.33333e-05,3.33333e-05,3e-05,3e-05,3e-05,
3e-05,3e-05,3e-05,3e-05,3e-05,3e-05,..many elided..}
histogram_bounds |
{9474697855526,186642098833097,425502410065792,655064117100237,917344884
999940,1135224280975580,1510900775316064,1919850381534192,23918286327044
65,2773745714634569,3197981109338899,3601128214604953,3887435029566307,4
289757501117626,4604286546172963,5030605000015434,5410915764179364,57126
62986537560,6096452674229658,6531206443844232,6761515475182966,692428185
0823004,7145897868348599,7357502317108796,7537560231072453,7737194605867
515,7923617661480232,8094845122681350,8304911973154200,8504211340608556,
8735469559703009,9008968782181381,9233161779966219,..many elided..}
correlation | 0.770339
The correlation is likely high here because this table has been
clustered on this column in the past. I don't know if that
contributes to the n_distinct inaccuracy, I don't know if I have the
patience to reorder the table to find out ;^)
Note that new st_ids are also being added all the time, at a rate
roughly proportional to fk rows (fk rows being added more
frequently). So actually a fractional value for the n_distinct here
would be more ideal. The docs hint that analyze will sometimes decide
to use a fractional (negative) value. What triggers that?
I was also trying to figure out how big the sample really is. Does a
stats target of 1000 mean 1000 rows sampled? If the sample really is
a fixed number of rows, it would seem to my naive eyes that sampling
a fraction of the rows (like 0.1% or something) would be better
(especially in cases like this), but maybe it already tries to do that.
Any insights appreciated.
-Casey