Search Postgresql Archives

Re: Need efficient way to remove (almost) duplicate rows from a table

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/17/23 15:48, David Gauthier wrote:
I have a table with millions of records and there are a lot of "almost" duplicates that I want to get rid of in an efficient way.  Best to illustrate with a simple example...

This sounds like a classic use-case for a "correlated subquery". Assuming you want to keep the lowest twig_id, you could do this:

```
paul=# select * from twigs;
 limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
       1 |         1 |       1 |      100
       1 |         1 |       2 |    103.7
       1 |         1 |       3 |    103.7
       1 |         1 |       4 |    110.4
       1 |         2 |       1 |      120
       1 |         2 |       2 |    123.6
       1 |         2 |       3 |    128.1
       1 |         2 |       4 |    128.1
       1 |         2 |       5 |    128.1
       2 |         1 |       1 |      100
       2 |         1 |       3 |    104.4
       2 |         1 |       4 |    131.9
(12 rows)

Time: 0.218 ms
paul=# delete from twigs as t1
where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id, t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id > t2.twig_id);
DELETE 3
Time: 8.555 ms
paul=# select * from twigs;
 limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
       1 |         1 |       1 |      100
       1 |         1 |       2 |    103.7
       1 |         1 |       4 |    110.4
       1 |         2 |       1 |      120
       1 |         2 |       2 |    123.6
       1 |         2 |       3 |    128.1
       2 |         1 |       1 |      100
       2 |         1 |       3 |    104.4
       2 |         1 |       4 |    131.9
(9 rows)

Time: 0.231 ms
```

That says "delete from t1 where there is a matching t2 with a lower twig_id."

If you want to keep the highest-numbered twig_id, the sql is easy to adjust.

Regards,

--
Paul              ~{:-)
pj@xxxxxxxxxxxxxxxxxxxxxxxx





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux