Re: Finding Duplicate Files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Alan wrote:

(if that makes sense). rsync --compare-dest and --link-dest : fantastic.

I wrote a program MANY years back that searches for duplicate files. (I
had a huge number of files from back in the BBS days that had the same
file but different names.)

Here is how I did it. (This was done using Perl 4.0 originally.)

Recurse through all the directories and build a hash of the file sizes. Go through the hash table and look for collisions. (This prevents you
from doing an MD5SUM on very large files that occur once.)  For each set
of collisions, build a hash table of MD5SUMS (the program now uses
SHA512).  Take any hash collisions and add them to a stack. Prompt the
user what to do with those entries.

There is also another optimization to the above.  The first hash should
only take the first 32k or so.  If there are collisions, then hash the
whole file and check for collisions on those.  This two pass check speeds
things up by a great deal of you have many large files of the same size. (Multi-part archives, for example.) Using this method I have removed all
the duplicate files on a terabyte drive in about 3 hours or so.  (Without
the above optimization.)

I suppose it is a little late to mention this now, but backuppc (http://backuppc.sourceforge.net/) does this automatically as it copies in files and compresses them in addition to eliminating the duplication. If you used it instead of an ad-hoc set of copies as backups in the first place you'd have a web browser view of everything in its original locations at the backup intervals, but taking up less space that one original copy (depending on the amount of change...).

--
  Les Mikesell
   lesmikesell@xxxxxxxxx

--
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list
[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [Fedora Magazine]     [Fedora News]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Maintainers]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Legacy]     [Fedora Desktop]     [Fedora Fonts]     [ATA RAID]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [SSH]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Centos]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Tux]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Asterisk PBX]     [Fedora Sparc]     [Fedora Universal Network Connector]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux