Suboptimal query plan fixed by replacing OR with UNION

Steven Schlansker <steven@xxxxxxxxxxxx> · Thu, 5 Jul 2012 14:57:49 -0700

Hi all,

I have a query which is being optimized very differently depending on whether it is written using an OR clause or a UNION clause.

I believe that the query results should be the same, and even if I've missed something with regards to something small (e.g. NULL handling) I do not believe that it's a good excuse for the plan the optimizer ends up with.

First, the table:

public | account                | table    | staging | 7719 MB    | 
                   Table "public.account"
      Column       |            Type             | Modifiers 
-------------------+-----------------------------+-----------
 id                | uuid                        | not null
 source            | character varying           | not null
 source_id         | character varying           | not null
 name              | character varying           | 
 email             | character varying           | 
 photo_url         | character varying           | 
 user_id           | uuid                        | 
 creation_date     | timestamp without time zone | not null
 modification_date | timestamp without time zone | not null
 first_linked_date | timestamp without time zone | 
Indexes:
    "account_pkey" PRIMARY KEY, btree (id)
    "account_source_id_idx" UNIQUE, btree (source, source_id)
    "account_id_user_id_idx" btree (id, user_id)
    "account_user_id_idx" btree (user_id)
    "ness_user_email_idx" btree (email)

Some abbreviated statistics (all queries below were planned after running this ANALYZE statement):

INFO:  analyzing "public.account"
INFO:  "account": scanned 30000 of 987795 pages, containing 742040 live rows and 1932 dead rows; 30000 rows in sample, 24102216 estimated total rows

      attname      | null_frac | avg_width | n_distinct | correlation 
-------------------+-----------+-----------+------------+-------------
 email             |    0.9987 |        22 |         -1 |   -0.100607
 creation_date     |         0 |         8 |         -1 |    0.679791
 first_linked_date |         1 |         8 |          0 |            
 id                |         0 |        16 |         -1 |    0.680173
 source_id         |         0 |        11 |  -0.949212 |  -0.0792623
 user_id           |    0.9956 |        16 |        129 |  -0.0797483
 source            |         0 |         8 |          6 |  -0.0118729
 modification_date |         0 |         8 |         -1 |     0.93162
 name              |  0.170433 |        14 |     135438 |   -0.005636
 photo_url         |         0 |        49 |     180319 |    0.172699

FWIW, the "estimated total rows" is within 0.01% of the true value.

Now, the problematic query:

SELECT * FROM account
  WHERE user_id in 
    (SELECT user_id FROM account 
      WHERE id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'))
  OR
    id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}');

This query gives the plan:

 Seq Scan on account  (cost=29.59..1379485.60 rows=12051109 width=160)
   Filter: ((hashed SubPlan 1) OR (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])))
   SubPlan 1
     ->  Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=16)
           Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
           ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
                 Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
(7 rows)

I can't imagine why it picks a sequential scan.  Besides the ridiculous estimate, it takes most of a minute to finish.

Running either query independently comes to a very reasonable plan:

ness_user=# explain SELECT * FROM account WHERE 
ness_user-# user_id in (SELECT user_id FROM account WHERE id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'));
                                                                               QUERY PLAN                                                                                
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=66.79..3228.18 rows=2410 width=160)
   ->  HashAggregate  (cost=29.59..29.60 rows=1 width=16)
         ->  Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=16)
               Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
               ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
                     Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
   ->  Bitmap Heap Scan on account  (cost=37.20..3188.55 rows=803 width=160)
         Recheck Cond: (user_id = public.account.user_id)
         ->  Bitmap Index Scan on account_user_id_idx  (cost=0.00..37.00 rows=803 width=0)
               Index Cond: (user_id = public.account.user_id)
(10 rows)

ness_user=# explain SELECT * FROM account WHERE id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}');
                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=160)
   Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
   ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
         Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
(4 rows)

(where "reasonable" is defined as "not a sequential scan")

Upon seeing this -- I had a crazy idea.  What if I just paste them together with a UNION DISTINCT?

ness_user=# explain SELECT * FROM account WHERE 
ness_user-# user_id in (SELECT user_id FROM account WHERE id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}')) UNION DISTINCT
ness_user-# SELECT * FROM account WHERE
ness_user-# id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}');
                                                                                     QUERY PLAN                                                                                      
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=3342.22..3366.35 rows=2413 width=160)
   ->  Append  (cost=66.79..3281.90 rows=2413 width=160)
         ->  Nested Loop  (cost=66.79..3228.18 rows=2410 width=160)
               ->  HashAggregate  (cost=29.59..29.60 rows=1 width=16)
                     ->  Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=16)
                           Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
                           ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
                                 Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
               ->  Bitmap Heap Scan on account  (cost=37.20..3188.55 rows=803 width=160)
                     Recheck Cond: (user_id = public.account.user_id)
                     ->  Bitmap Index Scan on account_user_id_idx  (cost=0.00..37.00 rows=803 width=0)
                           Index Cond: (user_id = public.account.user_id)
         ->  Bitmap Heap Scan on account  (cost=17.56..29.58 rows=3 width=160)
               Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
               ->  Bitmap Index Scan on account_id_user_id_idx  (cost=0.00..17.56 rows=3 width=0)
                     Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))
(16 rows)

Wow!  Changing the query from using an OR clause to a UNION DISTINCT with two SELECTs reduced the cost from 1379485.60 to 3366.35!  And the gains are realized when you actually execute the query.

Why is using an OR so awful here?  Why does it pick a sequential scan?  Is this an optimizer bug or have I missed something in my queries?

Thanks much for any advice,
Steven Schlansker

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general