Re: Strange avg value size for index on expression in pg_stats

Jehan-Guillaume de Rorthais <ioguix@xxxxxxx> · Mon, 10 Nov 2014 16:41:11 +0100

On Mon, 10 Nov 2014 09:39:23 -0500
Tom Lane <tgl@xxxxxxxxxxxxx> wrote:

> Jehan-Guillaume de Rorthais <ioguix@xxxxxxx> writes:
> > While investigating about a wrong result with my btree bloat estimation
> > query, I found a strange stat deviation between the average size of a value
> > in its table and its average size in one index on the "lower" expression.
> 
> ANALYZE doesn't look at the physical index at all.  For an expression
> index, it will recompute the index expression at each sampled row, and
> then take statistics on those values.  In the case you've got here, it's
> re-running the lower() function and looking at the uncompressed result.
> So that accounts for the discrepancy in average width stats.

Ok, understood.

But in my demo scenario, I used *only* md5 to populate the test table. So
data length is always 32 and the average width length is exact. No possible
deviation, even with lower(). To be quite dense:

  postgres@test=# select 
    length(lower(md5('a'))) = length(md5('b')), 
    length(md5('c')) = length(md5('d'));
  -[ RECORD 1 ]
  ?column? | t
  ?column? | t

And here is another test with a static string for all rows:

  postgres@test=# create table test as
  select '1234567890123456789012'::text as t
  from generate_series(1,100);
  SELECT 100

  postgres@test=# create index ON test (lower(t));
  CREATE INDEX

  postgres@test=# analyze test;
  ANALYZE

  postgres@test=# select tablename, attname, avg_width
  from pg_stats
  where schemaname = 'public';
     tablename    | attname | avg_width 
  ----------------+---------+-----------
   test           | t       |        23
   test_lower_idx | lower   |        26

AFAIU, we should not have a discrepancy here.

> > This tiny difference is the source of a very bad estimation with the
> > Btree bloat estimation query when values are around an alignement
> > boundary.
> 
> TBH, if that query is relying on ANALYZE width estimates to be accurate
> to the last byte, its reliability is going to be horrid anyway.

Well, I'm aware of that. I don't need an accuracy to the last byte. This query
doesn't even pay attention to the data alignment padding anyway (I measured some
10% deviation in a case because of this).

This request only helps guessing the bloat evolution in Btrees or quickly
discover *big* deviations. In many situations, we can not afford a call to
pgstattuple.avg_leaf_density().

But this statistic difference between two values with the exact same size is
itching me. Sa far, I couldn't find a logical explanation and it just looks
like a wrong statistic.

Regards,
-- 
Jehan-Guillaume (ioguix) de Rorthais

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general