On Mon, 10 Nov 2014 09:39:23 -0500 Tom Lane <tgl@xxxxxxxxxxxxx> wrote: > Jehan-Guillaume de Rorthais <ioguix@xxxxxxx> writes: > > While investigating about a wrong result with my btree bloat estimation > > query, I found a strange stat deviation between the average size of a value > > in its table and its average size in one index on the "lower" expression. > > ANALYZE doesn't look at the physical index at all. For an expression > index, it will recompute the index expression at each sampled row, and > then take statistics on those values. In the case you've got here, it's > re-running the lower() function and looking at the uncompressed result. > So that accounts for the discrepancy in average width stats. Ok, understood. But in my demo scenario, I used *only* md5 to populate the test table. So data length is always 32 and the average width length is exact. No possible deviation, even with lower(). To be quite dense: postgres@test=# select length(lower(md5('a'))) = length(md5('b')), length(md5('c')) = length(md5('d')); -[ RECORD 1 ] ?column? | t ?column? | t And here is another test with a static string for all rows: postgres@test=# create table test as select '1234567890123456789012'::text as t from generate_series(1,100); SELECT 100 postgres@test=# create index ON test (lower(t)); CREATE INDEX postgres@test=# analyze test; ANALYZE postgres@test=# select tablename, attname, avg_width from pg_stats where schemaname = 'public'; tablename | attname | avg_width ----------------+---------+----------- test | t | 23 test_lower_idx | lower | 26 AFAIU, we should not have a discrepancy here. > > This tiny difference is the source of a very bad estimation with the > > Btree bloat estimation query when values are around an alignement > > boundary. > > TBH, if that query is relying on ANALYZE width estimates to be accurate > to the last byte, its reliability is going to be horrid anyway. Well, I'm aware of that. I don't need an accuracy to the last byte. This query doesn't even pay attention to the data alignment padding anyway (I measured some 10% deviation in a case because of this). This request only helps guessing the bloat evolution in Btrees or quickly discover *big* deviations. In many situations, we can not afford a call to pgstattuple.avg_leaf_density(). But this statistic difference between two values with the exact same size is itching me. Sa far, I couldn't find a logical explanation and it just looks like a wrong statistic. Regards, -- Jehan-Guillaume (ioguix) de Rorthais -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general