Re: Deletion Challenge

Steve Crawford <scrawford@xxxxxxxxxxxxxxxxxxxx> · Wed, 9 Dec 2015 13:45:52 -0800

The two general solutions are the "keep the last one" proposed by Adrian "keep the last N" that I sent.
But it might be worth stepping back a bit. You said you are having performance problems that you feel would be improved by removing only a million rows which doesn't sound like that much to me. It's less than half of what I *add* to just one of my tables every week and my database is dwarfed by those of many of the participants on this list.

This suggests that there may be other issues such as tuning, indexing or query optimization at play. Depending on your requirements, partitioning might be useful. It wouldn't be last N but could easily be done to partition by date-ranges which makes archiving and purging a low-cost operation.

You might want to expand a bit on the core issue you are trying to solve.

Cheers,
Steve

On Wed, Dec 9, 2015 at 12:43 PM, David G. Johnston <david.g.johnston@xxxxxxxxx> wrote:
On Wed, Dec 9, 2015 at 1:31 PM, Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 12/09/2015 12:24 AM, Berend Tober wrote:

Adrian Klaver wrote:

On 12/05/2015 08:08 AM, Berend Tober wrote:

/*

Deletion Challenge

I want to delete all but the most recent transaction, per person, from a

table that records a transaction history because at some point the

transaction history grows large enough to adversely effect performance,

and also becomes less relevant for retention.

...

test=> delete from cash_journal where ARRAY[click, cash_journal_id]

NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal

group by fairian_id);

DELETE 7

test=> SELECT * FROM cash_journal order by fairian_id, click,

cash_journal_id;

  click | cash_journal_id | fairian_id | debit | credit | balance

|           description

-------+-----------------+------------+-------+--------+---------+----------------------------------

    412 |               1 |          7 |     5 |        |      14 |

Sold food quantity 7 units.

     37 |               7 |          8 |     8 |        |       8 |

Ratified contract f1abd670358e03

     37 |               9 |          9 |     7 |        |       7 |

Ratified contract 1574bddb75c78a

     36 |              14 |         18 |     0 |      0 |       0 |

initial cash balance

    413 |               1 |         25 |       |    995 |       0 |

Redeemed bond 7719a1c782a1ba

(5 rows)

Nice.

The idea of a NOT IN query had occurred to me briefly, but I failed to

pursue it because at some point in the distant past I had gained the

impression that NOT IN queries were not computationally efficient.

During one round of testing I had like a million rows. I'll have to run

some EXPLAIN query testing with a larger data sample for comparison.

Thanks!

Plan B:

WITH d AS

    (SELECT * FROM

        cash_journal

    LEFT JOIN

        (SELECT

            MAX(ARRAY[click,cash_journal_id]) AS mx

        FROM

            cash_journal

        GROUP BY

            fairian_id)

        AS

            mxa

    ON

        mxa.mx=ARRAY[click, cash_journal_id]

    WHERE

        mx IS NULL)

DELETE FROM

    cash_journal

USING

    d

WHERE

    d.click = cash_journal.click

AND

    d.cash_journal_id = cash_journal.cash_journal_id;

Couldn't the LEFT JOIN relation in the CTE be better written using "SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...] ORDER BY click DESC, cash_journal_id" or something similar?  It doesn't seem like you should need to introduce an array and an aggregate here.

It does have the negative property of only providing a single row; which excludes using it for the "last 5" part but I suspect it will be considerably faster for the single version.

David J.