Re: Index only scan sometimes switches to sequential scan for small amount of rows

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> · Wed, 25 Mar 2015 13:45:06 +0100

On 25.3.2015 13:04, Feike Steenbergen wrote:
...
> When analyzing pg_stats we have sometimes have the following: (Note:
> 'NOT_YET_PRINTED' has not been found during this analyze, these are
> real values)
> 
>  attname                | status
>  inherited              | f
>  null_frac              | 0
>  avg_width              | 4
>  n_distinct             | 3
>  most_common_vals       | {PRINTED}
>  most_common_freqs      | {0.996567}
>  histogram_bounds       | {PREPARED,ERROR}
>  correlation            | 0.980644
> 
> A question about this specific entry, which some of you may be able to
> shed some light on:
> 
> most_common_vals contains only 1 entry, why is this? I would expect to
> see 3 entries, as it has n_distinct=3

To be included in the MCV list, the value has to actually appear in the
random sample at least twice, IIRC. If the values are very rare (e.g. if
you only have such 10 rows out of 3.5M), that may not happen.

You may try increasing the statistics target for this column, which
should make the sample larger and stats more detailed (max is 10000,
which should use sample ~3M rows, i.e. almost the whole table).

> When looking at
> http://www.postgresql.org/docs/current/static/row-estimation-examples.html
> we can see that an estimate > 5000 is what is to be expected for these
> statistics:
> 
> # select ( (1 - 0.996567)/2 * 3500000 )::int;
>  int4
> ------
>  6008
> (1 row)
> 
> But why does it not record the frequency of 'PREPARED' and 'ERROR'
> in most_common_*?

Can you post results for this query?

SELECT stats, COUNT(*) FROM print_list group by 1

I'd like to know how frequent the other values are.

> 
> Our current strategies in mitigating this problem is decreasing the 
> autovacuum_*_scale_factor for this specific table, therefore 
> triggering more analyses and vacuums.

I'm not sure this is a good solution. The problem is elsewhere, IMHO.

> This is helping somewhat, as if the problem occurs it often solved 
> automatically if autoanalyze analyzes this table, it is analyzed
> many times an hour currently.
> 
> We can also increase the 'Stats target' for this table, which will
> cause the statistics to contain information about 'NOT_YET_PRINTED'
> more often, but even then, it may not find any of these records, as
> they sometimes do not exist.

This is a better solution, IMHO.

> 
> Could you help us to find a strategy to troubleshoot this issue
> further?

You might also make the index scans cheaper, so that the switch to
sequential scan happens later (when more rows are estimated). Try to
decreasing random_page_cost from 4 (default) to 1.5 or something like that.

It may hurt other queries, though, depending on the dataset size etc.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance