Re: rows selectivity overestimate for @> operator for arrays

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 27, 2022 at 12:19 PM Alexey Ermakov <alexey.ermakov@xxxxxxxxxxxxx> wrote:
Hello, please look into following example:

postgres=# create table test_array_selectivity as select
array[id]::int[] as a from generate_series(1, 10000000) gs(id);
SELECT 10000000
postgres=# explain analyze select * from test_array_selectivity where a
@> array[1];
                                                          QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
  Seq Scan on test_array_selectivity  (cost=0.00..198531.00 rows=50000
width=32) (actual time=0.023..2639.029 rows=1 loops=1)
    Filter: (a @> '{1}'::integer[])
    Rows Removed by Filter: 9999999
  Planning Time: 0.078 ms
  Execution Time: 2639.038 ms
(5 rows)


for row estimation rows=50000=10000000*0.005 we are using constant
DEFAULT_CONTAIN_SEL if I'm not mistaken.
and we're using it unless we have something in most_common_elems (MCE)
in statistics which in this case is empty.


This was discussed before at https://www.postgresql.org/message-id/flat/CAMkU%3D1x2W1gpEP3AQsrSA30uxQk1Sau5VDOLL4LkhWLwrOY8Lw%40mail.gmail.com

My solution was to always store at least one element in the MCE, even if the sample size was too small to be reliable.  It would still be more reliable than the alternative fallback assumption.  That patch still applies and fixes your example, or improves it anyway and to an extent directly related to the stats target size. (It also still has my bogus code comments in which I confuse histogram with n_distinct). 

Then some other people proposed more elaborate patches, and I never wrapped my head around what they were doing differently or why the elaboration was important.

Since you're willing to dig into the source code and since this is directly applicable to you, maybe you would be willing to go over to pgsql-hackers to revive, test, and review these proposals with an eye of getting them applied in v16.

I'm not sure if there is a simple fix for this, maybe store and use
something like n_distinct for elements for selectivity estimation ? or
perhaps we should store something in MCE list anyway even if frequency
is low (at least one element) ?

n_distinct might be the best solution, but I don't see how it could be adapted to the general array case.  If it could only work when the vast majority or arrays had length 1, I think that would be too esoteric to be accepted. 

Cheers,

Jeff

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux