IN() statement values order makes 2x performance hit

Alexey Kupershtokh <alexey.kupershtokh@xxxxxxxxx> · Thu, 29 May 2008 16:38:38 +0700

Hello everybody!

I have found a performance issue with 2 equivalent queries stably
taking different (~x2) time to finish. In just a few words it can be
described like this: if you have a lot of values in an IN() statement,
you should put most heavy (specifying most number of rows) ids first.

This is mostly just a bug submit, than looking for help.

So this is what I have:

   RHEL
   PostgreSQL 8.3.1
   A table ext_feeder_item with ~4.6M records.

    kia=# \d+ ext_feeder_item;

Table
"public.ext_feeder_item"

Column | Type | Modifiers | Description

----------+--------------------------+--------------------------------------------------------------+-------------

id | bigint | not null default
nextval('ext_feeder_item_id_seq'::regclass) |

feed_id | bigint | not
null |

pub_date | timestamp with time zone
| |

Indexes:

"ext_feeder_item_pkey" PRIMARY KEY, btree (id)

"ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date)

"ext_feeder_item_idx" btree (feed_id)

Triggers:

....

Has OIDs: no

  Statistics for the fields feed_id and pub_date are set to 1000;
  The table have just been vacuumed and analyzed.
  A simple query to the table:

     SELECT

id

FROM

ext_feeder_item AS i

WHERE

i.feed_id IN (...)

ORDER BY pub_date DESC, id DESC

LIMIT 11 OFFSET 0;

    with many (~1200) ids in the IN() statement.
  The count of rows distribution for these ids (may be thought of
as foreign keys in this table) is the following:

id = 54461: ~180000 - actually the most heavy id in the whole table.

other ids: a single id at most specifies 2032 rows; 6036 rows total.
  If I perform a query with

    IN(54461, ...)

it stably (5 attempts) takes 4.5..4.7 secs. to perform.

    QUERY PLAN

Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
time=4585.420..4585.452 rows=11 loops=1)

  ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
(actual time=4585.415..4585.425 rows=11 loops=1)

        Sort Key: pub_date, id

        Sort Method:  top-N heapsort  Memory: 17kB

        ->  Bitmap Heap Scan on ext_feeder_item i
(cost=13832.40..1449341.79 rows=617228 width=16) (actual
time=894.622..4260.441 rows=185625 loops=1)

              Recheck Cond: (feed_id = ANY ('{54461, ...}'::bigint[]))

              ->  Bitmap Index Scan on ext_feeder_item_idx
(cost=0.00..13678.10 rows=617228 width=0) (actual time=884.686..884.686
rows=185625 loops=1)

                    Index Cond: (feed_id = ANY ('{54461,
...}'::bigint[]))

Total runtime: 4585.852 ms

  If I perform a query with

    IN(..., 54461)

    it stably (5 attempts) takes 9.3..9.5 secs. to perform.

    QUERY PLAN

Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
time=9330.267..9330.298 rows=11 loops=1)

  ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
(actual time=9330.263..9330.273 rows=11 loops=1)

        Sort Key: pub_date, id

        Sort Method:  top-N heapsort  Memory: 17kB

        ->  Bitmap Heap Scan on ext_feeder_item i
(cost=13832.40..1449341.79 rows=617228 width=16) (actual
time=1018.401..8971.029 rows=185625 loops=1)

              Recheck Cond: (feed_id = ANY ('{... ,54461}'::bigint[]))

              ->  Bitmap Index Scan on ext_feeder_item_idx
(cost=0.00..13678.10 rows=617228 width=0) (actual
time=1008.791..1008.791 rows=185625 loops=1)

                    Index Cond: (feed_id = ANY ('{...
,54461}'::bigint[]))

Total runtime: 9330.729 ms

I don't know what are the roots of the problem, but I think that some
symptomatic healing could be applied: the PostgreSQL could sort the IDs
due to the statistics.

So currently I tend to select the IDs from another table ordering them
due to their weights: it's easy for me thanks to denormalization.

Also I would expect from PostgreSQL that it sorted the values to make
index scan more sequential, but this expectation already conflicts with
the bug described above :)