row estimate very wrong for array type

Denis de Bernardy <ddebernardy@xxxxxxxxx> · Wed, 4 May 2011 06:40:50 -0700 (PDT)

Hi,

I'm running into erroneous row estimates when using an array type, and I'm running out of ideas on how to steer postgres into the right direction... I've tried setting statistics to 1000, vacuuming and analyzing over and over, rewriting the query differently... to no avail.

The table looks like this:

create table test (
id serial primary key,
sortcol float unique,
intarr1 int[] not null default '{}',
intarr2 int[] not null default '{}'

);

It contains 40k rows of random data which, for the sake of the queries that follow, aren't too far from what they'd contain in production.

# select intarr1, intarr2, count(*) from test group by intarr1, intarr2;

 intarr1 | intarr2 | count 
--------------+-------------+-------
 {0}          | {2,3}       | 40018

The stats seem right:

# select * from pg_stats where tablename = 'test' and attname = 'intarr1';

 schemaname | tablename |   attname    | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation 
------------+-----------+--------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
 test       | test      | intarr1 | f         |         0 |        25 |          1 | {"{0}"}          | {1}               |                  |           1

A query without any condition on the array results in reasonable estimates and the proper use of the index on the sort column:

# explain analyze select * from test order by sortcol limit 10;

                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..3.00 rows=10 width=217) (actual time=0.098..0.109 rows=10 loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..12019.08 rows=40018 width=217) (actual time=0.096..0.105 rows=10 loops=1)
 Total runtime: 0.200 ms

After adding a condition on the array, however, the row estimate is completely off:

# explain analyze select * from test where intarr1 && '{0,1}' order by sortcol limit 10;

                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..605.96 rows=10 width=217) (actual time=0.131..0.175 rows=10 loops=1)
   ->  Index Scan using test_sortcol_key on test  (cost=0.00..12119.13 rows=200 width=217) (actual time=0.129..0.169 rows=10 loops=1)
         Filter: (intarr1 && '{0,1}'::integer[])

When there's a condition on both arrays, this then leads to a seq scan:

# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 && '{2,4}' order by sortcol limit 10;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.260..88.265 rows=10 loops=1)
   ->  Sort  (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.258..88.260 rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         ->  Seq Scan on test  (cost=0.00..3241.27 rows=1 width=217) (actual time=0.169..68.785 rows=40018 loops=1)
               Filter: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
 Total runtime: 88.339 ms

Adding a GIN index on the two arrays results in similar ugliness:

# explain analyze select * from test where intarr1 && '{0,1}' and intarr2 && '{2,4}' order by lft limit 10;
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.29..8.29 rows=1 width=217) (actual time=56.122..56.127 rows=10 loops=1)
   ->  Sort  (cost=8.29..8.29 rows=1 width=217) (actual time=56.120..56.122 rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         ->  Bitmap Heap Scan on test  (cost=4.26..8.28 rows=1 width=217) (actual time=19.635..39.824 rows=40018 loops=1)
               Recheck Cond: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
               ->  Bitmap Index Scan on test_intarr1_intarr2_idx  (cost=0.00..4.26 rows=1 width=0) (actual time=19.387..19.387 rows=40018 loops=1)
                     Index Cond: ((intarr1 && '{0,1}'::integer[]) AND (intarr2 && '{2,4}'::integer[]))
 Total runtime: 56.210 ms

Might this be a bug in the operator's selectivity, or am I doing something wrong?

Thanks in advance,
Denis

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance