Re: Index only scan sometimes switches to sequential scan for small amount of rows

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 25, 2015 at 9:07 AM, Feike Steenbergen <feikesteenbergen@xxxxxxxxx> wrote:
I'm posting this as I am trying to understand what has happened.
TLDR: The problem seems to be fixed now.

By bumping the statistics_target we see that most_common_vals is
having its contents filled more often, causing way better estimates:

 attname                | status
 inherited              | f
 null_frac              | 0
 avg_width              | 4
 n_distinct             | 3
 most_common_vals       | {PRINTED,PREPARED,ERROR}
 most_common_freqs      | {0.996863,0.00307333,6.33333e-05}
 histogram_bounds       | (null)
 correlation            | 0.98207
 most_common_elems      | (null)
 most_common_elem_freqs | (null)
 elem_count_histogram   | (null)

Basically 100% of the records are accounted for in these statistics,
the planner now consistently estimates the number of rows to be very
small for other values.

Before bumping the target we didn't have information for 0.34% of the
rows, which in this case means roughly 11K rows.

What is the reasoning behind having at least 2 hits before including
it in the most_common_* columns?

If you sample a small portion of the table, then anything only present once is going to be have a huge uncertainty on its estimate.

Consider the consequences of including things sampled once.  100% of the rows that got sampled will be in the sample at least once.  That means most_common_freqs will always sum to 100%.   Which means we are declaring that anything not observed in the sample has a frequency of 0.000000000000%, which is clearly beyond what we have any reasonable evidence to support.

Also, I doubt that that is the problem in the first place.  If you collect a sample of 30,000 (which the default target size of 100 does), and the frequency of the second most common is really 0.00307333 at the time you sampled it, you would expect to find it 92 times in the sample. The chances against actually finding 1 instead of around 92 due to sampling error are astronomical.  

The problem seems to be rapidly changing stats, not too small of a target size (unless your original target size was way below the current default value, forgive me if you already reported that, I didn't see it anywhere).

If you analyze the table at a point when it is 100% PRINTED, there is no way of knowing based on that analysis alone what the distribution of !='PRINTED' would be, should such values ever arise.

Maybe it would work better if you built the partial index where status = 'NOT_YET_PRINTED', instead of !='PRINTED'.

Cheers,

Jeff

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux