Re: Need efficient way to remove (almost) duplicate rows from a table

Ron <ronljohnsonjr@xxxxxxxxx> · Tue, 17 Oct 2023 18:06:41 -0500



    https://dba.stackexchange.com/a/138327/63913

    
    On 10/17/23 17:48, David Gauthier
      wrote:

    
      v11.5 on linux (I know, I know, IT is working on
        the upgrade to 15.2)
        

        I have a table with millions of records and there are a lot
          of "almost" duplicates that I want to get rid of in an
          efficient way.  Best to illustrate with a simple example...
        

        We'll talk about deleting leaves on a tree where
          each twig can have many leaves, but never 2 or more leaves of
          the exact same surface area.  What I have how are a lot of
          twigs with leaves having the same surface area (and some
          different one too) and I want to get rid of the duplicates for
          that twig.

        
        create table twigs (limb_id int, branch_id int, twig_id
          int, surfarea float);
        insert into twigs (linb_id,branch_id,twig_id,surfarea)
          values
        (1,1,1,100.0),
        (1,1,2,103.7),
        (1,1,3,103.7),
        (1,1,4,110.4),
        

        (1,2,1,120.0),
        (1,2,2,123.6),
        (1,2,3,128.1),
        (1,2,4,128.1),
        (1,2,5,128.1),
        

        (2,1,1,100.0),
        (2,1,3,104.4),
        (2,1,4,131.9);  
        

        You can see the duplicates in red.  I want to get rid of
          all but one of the dups.  Which "twig_id" that's left behind
          doesn't matter.
        

        This would do it...
        delete from twigs where limb_id=1 and branch_id=1 and
          twig_id=23;
        delete from twigs where limb_id=1 and branch_id=2 and
          twig_id in (4,5);
        

        But there are millions of these duplicates and it'll take
          forever like this.
        

        I was going to approach this with a perl/DBI script,
          shoving the duplicate record identifiers (limb_id, branch_id,
          twig_id) into perl arrays and then submitting the delete
          command in a prepared statement that accepts the arrays as
          values ... (a-la... my $cnt =
            $sth->execute_array({ ArrayTupleStatus => \my
            @tuple_status},\@limb_id_arr,\@branch_id_arr,\@twig_id_arr)
            or die "-F- Failed to execute '$sqf'\n";)   But I'm
          not sure that'll be faster.  Depends on how perl/DBI handles
          it I suppose.
        

        Then I was thinking it might just be better to create a
          parallel table and insert records in there by copying from the
          table with the dups, taking care to avoid inserting
          duplicates.  Once done, delete the original table and rename
          the copy to the original's name.  If that's a good approach,
          then how exactly do I articulate the insert statement to avoid
          the duplicates ?  This approach might also be good in that I
          can do it outside a transaction.  If the target table gets
          goofed up, the original data is still OK, no harm done, can
          try again.  
        

        Any other ideas ?  
        

        Again, millions of duplicates and I don't want to overload
          any PG system stuff in the process.
        

        Thanks in Advance !
      
    
    -- 

      Born in Arizona, moved to Babylonia.