On 10/17/23 15:48, David Gauthier wrote:
I have a table with millions of records and there are a lot of "almost"
duplicates that I want to get rid of in an efficient way. Best to
illustrate with a simple example...
This sounds like a classic use-case for a "correlated subquery".
Assuming you want to keep the lowest twig_id, you could do this:
```
paul=# select * from twigs;
limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
1 | 1 | 1 | 100
1 | 1 | 2 | 103.7
1 | 1 | 3 | 103.7
1 | 1 | 4 | 110.4
1 | 2 | 1 | 120
1 | 2 | 2 | 123.6
1 | 2 | 3 | 128.1
1 | 2 | 4 | 128.1
1 | 2 | 5 | 128.1
2 | 1 | 1 | 100
2 | 1 | 3 | 104.4
2 | 1 | 4 | 131.9
(12 rows)
Time: 0.218 ms
paul=# delete from twigs as t1
where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id,
t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id >
t2.twig_id);
DELETE 3
Time: 8.555 ms
paul=# select * from twigs;
limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
1 | 1 | 1 | 100
1 | 1 | 2 | 103.7
1 | 1 | 4 | 110.4
1 | 2 | 1 | 120
1 | 2 | 2 | 123.6
1 | 2 | 3 | 128.1
2 | 1 | 1 | 100
2 | 1 | 3 | 104.4
2 | 1 | 4 | 131.9
(9 rows)
Time: 0.231 ms
```
That says "delete from t1 where there is a matching t2 with a lower
twig_id."
If you want to keep the highest-numbered twig_id, the sql is easy to adjust.
Regards,
--
Paul ~{:-)
pj@xxxxxxxxxxxxxxxxxxxxxxxx