rows selectivity overestimate for @> operator for arrays

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello, please look into following example:

postgres=# create table test_array_selectivity as select array[id]::int[] as a from generate_series(1, 10000000) gs(id);
SELECT 10000000
postgres=# explain analyze select * from test_array_selectivity where a @> array[1];
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on test_array_selectivity  (cost=0.00..198531.00 rows=50000 width=32) (actual time=0.023..2639.029 rows=1 loops=1)
   Filter: (a @> '{1}'::integer[])
   Rows Removed by Filter: 9999999
 Planning Time: 0.078 ms
 Execution Time: 2639.038 ms
(5 rows)


for row estimation rows=50000=10000000*0.005 we are using constant DEFAULT_CONTAIN_SEL if I'm not mistaken. and we're using it unless we have something in most_common_elems (MCE) in statistics which in this case is empty.

if we have something in MCE list then we could use much better estimate (https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/utils/adt/array_selfuncs.c#L628):
elem_selec = Min(DEFAULT_CONTAIN_SEL, minfreq / 2)

for small tables we could get away with larger stats target for column, but for example in this case stats target 10000 is not enough.


if I'm reading sources correctly element frequency in sample should be more than 0.0063/stats_target to make it into MCE list:
https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/utils/adt/array_typanalyze.c#L471

so if we store mostly one element in array and they're almost all distinct then in tables with more then stats_target/0.0063 (~1.58M for maximum stats target 10000) rows we'll get 0.005 constant for selectivity. which could be pretty bad estimate (in real example it was 6-7 orders of magnitude difference).

I ran into this issue 2 times in last year with 2 different projects so perhaps it's not very rare situation. In first case increasing stats target helped, in second it didn't (for table with 150M rows), had to use hacks to fix the plan.

It was in PostgreSQL 12.x and 14.3.

I'm not sure if there is a simple fix for this, maybe store and use something like n_distinct for elements for selectivity estimation ? or perhaps we should store something in MCE list anyway even if frequency is low (at least one element) ?


--

Thanks,

Alexey Ermakov






[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux