Search Postgresql Archives

Re: How to speed up delete where not in

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 26 September 2015 at 19:53, Andrus <kobruleht2@xxxxxx> wrote:
Hi!


I'm looking for a way to delete records which do not have child rows on big tables where lot of rows needs to be deleted. Both tables have lot of other foreign key references.


Document headers are in omdok table:

   create table omdok ( dokumnr serial primary key, ... );

Document rows are in omrid table

   CREATE TABLE omrid
   (
     id serial NOT NULL,
     reanr serial NOT NULL,
     dokumnr integer NOT NULL,
     CONSTRAINT omrid_pkey PRIMARY KEY (id),
     CONSTRAINT omrid_dokumnr_fkey FOREIGN KEY (dokumnr)
         REFERENCES omdok (dokumnr) MATCH SIMPLE
         ON UPDATE CASCADE ON DELETE CASCADE DEFERRABLE INITIALLY IMMEDIATE,
     ....
   );


I tried

   delete from omdok where dokumnr not in      (select dokumnr from omrid)

Query it is running currently 15 hours and is still running.
postgres.exe is using 50% CPU  all the time (this is 2 core CPU).

   explain delete from omdok where dokumnr not in      (select dokumnr from omrid)

returns:

   "Delete  (cost=0.00..21971079433.34 rows=220815 width=6)"
   "  ->  Seq Scan on omdok  (cost=0.00..21971079433.34 rows=220815 width=6)"
   "        Filter: (NOT (SubPlan 1))"
   "        SubPlan 1"
   "          ->  Materialize  (cost=0.00..94756.92 rows=1897261 width=4)"
   "                ->  Seq Scan on omrid  (cost=0.00..77858.61 rows=1897261 width=4)"

- How to delete parents without child rows fast?

if you write the query as with a NOT EXISTS, rather than a NOT IN() it should complete much more quickly.

It's important to know that the semantics of NOT IN are likely not at all what you think:

For example, in the following query, would you expect it to return the row with a.a = 3 ?

select * from (values(1),(2),(3)) a(a) WHERE a NOT IN(SELECT b FROM (VALUES(NULL),(1),(2)) b(b));

The presence of NULL causes this to not behave the way you might think, yet it works exactly the way the SQL standard dictates.

You could think of this as equivalent to writing:

select * from (values(1),(2),(3)) a(a) WHERE a.a <> NULL AND a.a <> 1 AND a.a <> 2;

And since a.a <> NULL is 'unknown', this causes the entire WHERE clause to be false, therefore nothing matches.

For this reason PostgreSQL does not optimise NOT IN() the same way as it optimises NOT EXISTS().

If you write the query as:

delete from omdok where not exists(select 1 from omrid where omdok.dokumnr = omrid.dokumnr);

then you might see it execute in a few seconds. Perhaps you should consider cancelling the current query, perhaps perform a VACUUM on omdoc after cancelling, and then run the NOT EXISTS version instead.

I mocked up your tables locally, and populated them with the same number of records as your row estimates in the EXPLAIN you pasted and I got:

test=# delete from omdok where not exists (select 1 from omrid where omdok.dokumn = omrid.dokumnr);
DELETE 0
Time: 1698.233 ms

Whereas with the NOT IN() I cancelled it after 10 minutes.

Regards

David Rowley 

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux