Re: Need efficient way to remove (almost) duplicate rows from a table

Paul Jungwirth <pj@xxxxxxxxxxxxxxxxxxxxxxxx> · Tue, 17 Oct 2023 16:09:17 -0700

On 10/17/23 15:48, David Gauthier wrote:
I have a table with millions of records and there are a lot of "almost" 
duplicates that I want to get rid of in an efficient way.  Best to 
illustrate with a simple example...

This sounds like a classic use-case for a "correlated subquery". 
Assuming you want to keep the lowest twig_id, you could do this:

```
paul=# select * from twigs;
 limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
       1 |         1 |       1 |      100
       1 |         1 |       2 |    103.7
       1 |         1 |       3 |    103.7
       1 |         1 |       4 |    110.4
       1 |         2 |       1 |      120
       1 |         2 |       2 |    123.6
       1 |         2 |       3 |    128.1
       1 |         2 |       4 |    128.1
       1 |         2 |       5 |    128.1
       2 |         1 |       1 |      100
       2 |         1 |       3 |    104.4
       2 |         1 |       4 |    131.9
(12 rows)

Time: 0.218 ms
paul=# delete from twigs as t1
where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id, 
t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id > 
t2.twig_id);
DELETE 3
Time: 8.555 ms
paul=# select * from twigs;
 limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
       1 |         1 |       1 |      100
       1 |         1 |       2 |    103.7
       1 |         1 |       4 |    110.4
       1 |         2 |       1 |      120
       1 |         2 |       2 |    123.6
       1 |         2 |       3 |    128.1
       2 |         1 |       1 |      100
       2 |         1 |       3 |    104.4
       2 |         1 |       4 |    131.9
(9 rows)

Time: 0.231 ms
```

That says "delete from t1 where there is a matching t2 with a lower 
twig_id."

If you want to keep the highest-numbered twig_id, the sql is easy to adjust.

Regards,

--
Paul              ~{:-)
pj@xxxxxxxxxxxxxxxxxxxxxxxx