Re: diff or deduplicate two volumes with different folder structures

alan@xxxxxxxxxxxxxx · Tue, 20 Sep 2016 16:01:44 -0700

> On Tue, Sep 20, 2016 at 10:52:10PM +0200, Ahmad Samir wrote:
>> One last try (sometimes an issue nags):
>> $ find A -exec md5sum '{}' + > a-md5
>> $ find B -exec md5sum '{}' + > b-md5
>> $ cat a-md5 b-md5 > All
>> $ sort -u -k 1,1 All > dupes
>>
>> Now, (I hopefully got my head around it this time...), the dupes file
>> should contain a list of files that exist in _both_ A and B; but every
>> two files that have the same md5sum will have _only one_ of them
>> listed (either in A OR B). So if you delete that list of files you
>> should end up with only unique files in both locations.
>
> At the start ISTR you said the two directory trees were different.
> I took that to mean that two files with identical contents could
> be in different directories within the two trees.
>
> If I was wrong in that assumption and each pair of identical
> files would be in the same relative path I have two suggestions.
>
> 1. Sort a-md5 and b-md5
>    Use the comm(1) command.  It will give lines in both files,
>    in file a-md5 only and in b-md5 only with 0, 1, or 2 tabs.
>    You can also use options to get the 3 columns individually.
>    To do this you would have cd to A or B and run the find cmds
>    as "find .", not "find A or B".
>
> 2. Get a copy of an old program called dircmp* and run it on the
>    two trees directly.  It will output files only in tree A,
>    only in tree B, then output files in both noting whether
>    they are the same or different contents.
>
> I don't have the compiled version of dircmp, but I have a ksh
> shell script version that is quite similar.

Don't use MD5. You will get unintentional file collisions. (SHA-256 is
good. It depends on just how much you are comparing.)

What I use is a perl script that takes the directories I want to dedupe
and build a hash table of all the file sizes. I then go through that set
of hashes and ignore anything that has only one element for a particular
file size. Once I have a list of files with the same size, I then build a
hash table for the SHA-256 sums for those files. (I plan on adding a
preprocess to only hash the first 16k or so as a first pass to weed out
large files that are actually different.) Any place where I find a match
on both file size and SHA-256 hash I add to a queue to process later.

Sounds a bit complex, but it works pretty well. Depending on the number of
actual matches, you can go through a few terabytes in a short period of
time.

I hope that makes sense.

_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx