The two general solutions are the "keep the last one" proposed by Adrian "keep the last N" that I sent.
But it might be worth stepping back a bit. You said you are having performance problems that you feel would be improved by removing only a million rows which doesn't sound like that much to me. It's less than half of what I *add* to just one of my tables every week and my database is dwarfed by those of many of the participants on this list.
This suggests that there may be other issues such as tuning, indexing or query optimization at play. Depending on your requirements, partitioning might be useful. It wouldn't be last N but could easily be done to partition by date-ranges which makes archiving and purging a low-cost operation.
You might want to expand a bit on the core issue you are trying to solve.
Cheers,
Steve
On Wed, Dec 9, 2015 at 12:43 PM, David G. Johnston <david.g.johnston@xxxxxxxxx> wrote:
Plan B:On 12/09/2015 12:24 AM, Berend Tober wrote:
Adrian Klaver wrote:
On 12/05/2015 08:08 AM, Berend Tober wrote:
/*
Deletion Challenge
I want to delete all but the most recent transaction, per person, from a
table that records a transaction history because at some point the
transaction history grows large enough to adversely effect performance,
and also becomes less relevant for retention.
...
test=> delete from cash_journal where ARRAY[click, cash_journal_id]
NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
group by fairian_id);
DELETE 7
test=> SELECT * FROM cash_journal order by fairian_id, click,
cash_journal_id;
click | cash_journal_id | fairian_id | debit | credit | balance
| description
-------+-----------------+------------+-------+--------+---------+----------------------------------
412 | 1 | 7 | 5 | | 14 |
Sold food quantity 7 units.
37 | 7 | 8 | 8 | | 8 |
Ratified contract f1abd670358e03
37 | 9 | 9 | 7 | | 7 |
Ratified contract 1574bddb75c78a
36 | 14 | 18 | 0 | 0 | 0 |
initial cash balance
413 | 1 | 25 | | 995 | 0 |
Redeemed bond 7719a1c782a1ba
(5 rows)
Nice.
The idea of a NOT IN query had occurred to me briefly, but I failed to
pursue it because at some point in the distant past I had gained the
impression that NOT IN queries were not computationally efficient.
During one round of testing I had like a million rows. I'll have to run
some EXPLAIN query testing with a larger data sample for comparison.
Thanks!
WITH d AS
(SELECT * FROM
cash_journal
LEFT JOIN
(SELECT
MAX(ARRAY[click,cash_journal_id]) AS mx
FROM
cash_journal
GROUP BY
fairian_id)
AS
mxa
ON
mxa.mx=ARRAY[click, cash_journal_id]
WHERE
mx IS NULL)
DELETE FROM
cash_journal
USING
d
WHERE
d.click = cash_journal.click
AND
d.cash_journal_id = cash_journal.cash_journal_id;Couldn't the LEFT JOIN relation in the CTE be better written using "SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...] ORDER BY click DESC, cash_journal_id" or something similar? It doesn't seem like you should need to introduce an array and an aggregate here.It does have the negative property of only providing a single row; which excludes using it for the "last 5" part but I suspect it will be considerably faster for the single version.David J.