Hi all, I have a query which is being optimized very differently depending on whether it is written using an OR clause or a UNION clause. I believe that the query results should be the same, and even if I've missed something with regards to something small (e.g. NULL handling) I do not believe that it's a good excuse for the plan the optimizer ends up with. First, the table: public | account | table | staging | 7719 MB | Table "public.account" Column | Type | Modifiers -------------------+-----------------------------+----------- id | uuid | not null source | character varying | not null source_id | character varying | not null name | character varying | email | character varying | photo_url | character varying | user_id | uuid | creation_date | timestamp without time zone | not null modification_date | timestamp without time zone | not null first_linked_date | timestamp without time zone | Indexes: "account_pkey" PRIMARY KEY, btree (id) "account_source_id_idx" UNIQUE, btree (source, source_id) "account_id_user_id_idx" btree (id, user_id) "account_user_id_idx" btree (user_id) "ness_user_email_idx" btree (email) Some abbreviated statistics (all queries below were planned after running this ANALYZE statement): INFO: analyzing "public.account" INFO: "account": scanned 30000 of 987795 pages, containing 742040 live rows and 1932 dead rows; 30000 rows in sample, 24102216 estimated total rows attname | null_frac | avg_width | n_distinct | correlation -------------------+-----------+-----------+------------+------------- email | 0.9987 | 22 | -1 | -0.100607 creation_date | 0 | 8 | -1 | 0.679791 first_linked_date | 1 | 8 | 0 | id | 0 | 16 | -1 | 0.680173 source_id | 0 | 11 | -0.949212 | -0.0792623 user_id | 0.9956 | 16 | 129 | -0.0797483 source | 0 | 8 | 6 | -0.0118729 modification_date | 0 | 8 | -1 | 0.93162 name | 0.170433 | 14 | 135438 | -0.005636 photo_url | 0 | 49 | 180319 | 0.172699 FWIW, the "estimated total rows" is within 0.01% of the true value. Now, the problematic query: SELECT * FROM account WHERE user_id in (SELECT user_id FROM account WHERE id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}')) OR id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'); This query gives the plan: Seq Scan on account (cost=29.59..1379485.60 rows=12051109 width=160) Filter: ((hashed SubPlan 1) OR (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[]))) SubPlan 1 -> Bitmap Heap Scan on account (cost=17.56..29.58 rows=3 width=16) Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) -> Bitmap Index Scan on account_id_user_id_idx (cost=0.00..17.56 rows=3 width=0) Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) (7 rows) I can't imagine why it picks a sequential scan. Besides the ridiculous estimate, it takes most of a minute to finish. Running either query independently comes to a very reasonable plan: ness_user=# explain SELECT * FROM account WHERE ness_user-# user_id in (SELECT user_id FROM account WHERE id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}')); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=66.79..3228.18 rows=2410 width=160) -> HashAggregate (cost=29.59..29.60 rows=1 width=16) -> Bitmap Heap Scan on account (cost=17.56..29.58 rows=3 width=16) Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) -> Bitmap Index Scan on account_id_user_id_idx (cost=0.00..17.56 rows=3 width=0) Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) -> Bitmap Heap Scan on account (cost=37.20..3188.55 rows=803 width=160) Recheck Cond: (user_id = public.account.user_id) -> Bitmap Index Scan on account_user_id_idx (cost=0.00..37.00 rows=803 width=0) Index Cond: (user_id = public.account.user_id) (10 rows) ness_user=# explain SELECT * FROM account WHERE id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on account (cost=17.56..29.58 rows=3 width=160) Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) -> Bitmap Index Scan on account_id_user_id_idx (cost=0.00..17.56 rows=3 width=0) Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) (4 rows) (where "reasonable" is defined as "not a sequential scan") Upon seeing this -- I had a crazy idea. What if I just paste them together with a UNION DISTINCT? ness_user=# explain SELECT * FROM account WHERE ness_user-# user_id in (SELECT user_id FROM account WHERE id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}')) UNION DISTINCT ness_user-# SELECT * FROM account WHERE ness_user-# id = ANY('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=3342.22..3366.35 rows=2413 width=160) -> Append (cost=66.79..3281.90 rows=2413 width=160) -> Nested Loop (cost=66.79..3228.18 rows=2410 width=160) -> HashAggregate (cost=29.59..29.60 rows=1 width=16) -> Bitmap Heap Scan on account (cost=17.56..29.58 rows=3 width=16) Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) -> Bitmap Index Scan on account_id_user_id_idx (cost=0.00..17.56 rows=3 width=0) Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) -> Bitmap Heap Scan on account (cost=37.20..3188.55 rows=803 width=160) Recheck Cond: (user_id = public.account.user_id) -> Bitmap Index Scan on account_user_id_idx (cost=0.00..37.00 rows=803 width=0) Index Cond: (user_id = public.account.user_id) -> Bitmap Heap Scan on account (cost=17.56..29.58 rows=3 width=160) Recheck Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) -> Bitmap Index Scan on account_id_user_id_idx (cost=0.00..17.56 rows=3 width=0) Index Cond: (id = ANY ('{00000000-02f6-379d-c000-000000026810,00000000-0320-b467-c000-000000026810,00000000-000d-cefb-c000-000000026810}'::uuid[])) (16 rows) Wow! Changing the query from using an OR clause to a UNION DISTINCT with two SELECTs reduced the cost from 1379485.60 to 3366.35! And the gains are realized when you actually execute the query. Why is using an OR so awful here? Why does it pick a sequential scan? Is this an optimizer bug or have I missed something in my queries? Thanks much for any advice, Steven Schlansker -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general