Re: Curious why planner can't handle NOT IN

pradeep singh <pradeepsingh1988@xxxxxxxxx> · Sat, 4 May 2013 00:30:12 +0900

On May 3, 2013 12:15 PM, "Yang Zhang" <yanghatespam@xxxxxxxxx> wrote:

I have an `account` table with 5.3M rows, with primary key `id` of

type `text` (and 600+ columns if that matters).

I'm trying to create a `newaccount` table with the same schema but

600k newly imported rows, then insert all the old rows for which `id`

isn't already in the set of newly updated rows.

=> create table newaccount as select * from account limit 0;

=> \copy newaccount from stdin with (format csv)

[...copy 600k rows in...]

=> analyze newaccount;

The most obvious query doesn't work so hot due to repeated execution

of the subplan:

=> explain insert into newaccount select * from account where id not

in (select id from account);

                                       QUERY PLAN

-----------------------------------------------------------------------------------------

 Insert on newaccount  (cost=0.00..458800400224.89 rows=2985976 width=5366)

   ->  Seq Scan on account  (cost=0.00..458800400224.89 rows=2985976 width=5366)

         Filter: (NOT (SubPlan 1))

         SubPlan 1

           ->  Materialize  (cost=0.00..153276.00 rows=150000 width=32)

                 ->  Seq Scan on newaccount  (cost=0.00..151500.00

rows=150000 width=32)

(6 rows)

This works fine, though:

=> explain insert into newaccount select * from account where id in

(select id from account except select id from newaccount);

                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------

 Insert on newaccount  (cost=3749026.24..14034547.77 rows=2985976 width=5366)

   ->  Hash Join  (cost=3749026.24..14034547.77 rows=2985976 width=5366)

         Hash Cond: (account.id = "ANY_subquery".id)

         ->  Seq Scan on account  (cost=0.00..2263744.51 rows=5971951

width=5366)

         ->  Hash  (cost=3633552.85..3633552.85 rows=5971951 width=32)

               ->  Subquery Scan on "ANY_subquery"

(cost=3543223.59..3633552.85 rows=5971951 width=32)

                     ->  SetOp Except  (cost=3543223.59..3573833.34

rows=5971951 width=19)

                           ->  Sort  (cost=3543223.59..3558528.47

rows=6121951 width=19)

                                 Sort Key: "*SELECT* 1".id

                                 ->  Append  (cost=0.00..2476464.02

rows=6121951 width=19)

                                       ->  Subquery Scan on "*SELECT*

1"  (cost=0.00..2323464.02 rows=5971951 width=19)

                                             ->  Seq Scan on account

(cost=0.00..2263744.51 rows=5971951 width=19)

                                       ->  Subquery Scan on "*SELECT*

2"  (cost=0.00..153000.00 rows=150000 width=32)

                                             ->  Seq Scan on

newaccount  (cost=0.00..151500.00 rows=150000 width=32)

(14 rows)

This is all in PG 9.1.

This isn't a big deal as there's a straightforward workaround, but I

am curious what happened here.  Googling turns up various mentions of

"NOT IN" with poor plans that involve subplans.  Then again I have

read mention of hash anti-join which seems appropriate here(?), but

that wasn't used here (nor was the latter join used, though for

whatever reason it looks like a more complex/deeper plan tree than I

had expected, so maybe it was out of the plan generator's reach?).

E.g. the following mentions cranking up work_mem, but I probably can't

crank up work_mem to meet the requirements of this example, and even

if it doesn't fit in memory, it'd be nice for the planner to not

degenerate to a pathological plan and still execute this join

efficiently while spilling to and from disk.

http://stackoverflow.com/questions/7125291/postgresql-not-in-versus-except-performance-difference-edited-2

Thanks!

--

Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-general