Re: Delete/update with limit

Csaba Nagy <nagy@xxxxxxxxxxxxxx> · Tue, 24 Jul 2007 16:55:32 +0200

> How about using the following?
> 
> delete from <table>
> 	where ctid in (select ctid from <table> limit <num>);
> 

I actually checked this out before starting this thread, and the plan
looked like:

> explain delete from my_table where ctid in (select ctid from my_table
limit 10);
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Merge IN Join  (cost=101.68..108.03 rows=10 width=6)
   Merge Cond: (public.my_table.ctid = "IN_subquery".ctid)
   ->  Sort  (cost=101.11..104.21 rows=1240 width=6)
         Sort Key: public.my_table.ctid
         ->  Seq Scan on my_table  (cost=0.00..37.40 rows=1240 width=6)
   ->  Sort  (cost=0.57..0.59 rows=10 width=6)
         Sort Key: "IN_subquery".ctid
         ->  Limit  (cost=0.00..0.30 rows=10 width=6)
               ->  Seq Scan on my_table  (cost=0.00..37.40 rows=1240
width=6)
(9 rows)

It looked strange to me, and I thought it won't work too well on a big
table... but it's true that was a toy table, so let's try on a big one:

> explain delete from big_table where ctid in (select ctid from
big_table limit 10);
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Merge IN Join  (cost=11086906.66..11404636.41 rows=10 width=60)
   Merge Cond: (public.big_table.ctid = "IN_subquery".ctid)
   ->  Sort  (cost=11086906.26..11245771.06 rows=63545920 width=66)
         Sort Key: public.big_table.ctid
         ->  Seq Scan on big_table  (cost=0.00..834103.20 rows=63545920
width=66)
   ->  Sort  (cost=0.40..0.42 rows=10 width=6)
         Sort Key: "IN_subquery".ctid
         ->  Limit  (cost=0.00..0.13 rows=10 width=6)
               ->  Seq Scan on big_table  (cost=0.00..834103.20
rows=63545920 width=6)
(9 rows)

So, while the batch table is not expected to have 60M rows, on occasions
it got to a few 100Ks... and in that case the chunking would slow down
things even more.

I guess if the ctid in (...) thing would do a better job it would be the
best solution.

Regarding all the other questions, the "other DB" does the trick well
too, without any hidden cost. And the whole complicated mechanism is in
place not because of cost savings, but because I didn't find any better
way to do it so that concurrent inserts are neither slowed down nor
lost... the problem is that if you want to reliably delete only
processed rows, you must mark them somehow, and that would mean an
update + delete later - and I figured the delete + trigger + temp table
approach will be still cheaper. And the processing code will have to
scan the processed chunk multiple times, so for that purpose it is also
better to have it in a temp table. And we had to make sure an accidental
second run of the processor won't corrupt the data either (it happened
before)... the trigger approach helps there too...

We had here so many data losses on this processing with different
approaches, that I start to be tired about it... and this delete +
trigger + temp table looks to be the one which finally works correctly,
but gets us performance problems on occasions.

Cheers,
Csaba.