> On Tue, Sep 20, 2016 at 10:52:10PM +0200, Ahmad Samir wrote: >> One last try (sometimes an issue nags): >> $ find A -exec md5sum '{}' + > a-md5 >> $ find B -exec md5sum '{}' + > b-md5 >> $ cat a-md5 b-md5 > All >> $ sort -u -k 1,1 All > dupes >> >> Now, (I hopefully got my head around it this time...), the dupes file >> should contain a list of files that exist in _both_ A and B; but every >> two files that have the same md5sum will have _only one_ of them >> listed (either in A OR B). So if you delete that list of files you >> should end up with only unique files in both locations. > > At the start ISTR you said the two directory trees were different. > I took that to mean that two files with identical contents could > be in different directories within the two trees. > > If I was wrong in that assumption and each pair of identical > files would be in the same relative path I have two suggestions. > > 1. Sort a-md5 and b-md5 > Use the comm(1) command. It will give lines in both files, > in file a-md5 only and in b-md5 only with 0, 1, or 2 tabs. > You can also use options to get the 3 columns individually. > To do this you would have cd to A or B and run the find cmds > as "find .", not "find A or B". > > 2. Get a copy of an old program called dircmp* and run it on the > two trees directly. It will output files only in tree A, > only in tree B, then output files in both noting whether > they are the same or different contents. > > I don't have the compiled version of dircmp, but I have a ksh > shell script version that is quite similar. Don't use MD5. You will get unintentional file collisions. (SHA-256 is good. It depends on just how much you are comparing.) What I use is a perl script that takes the directories I want to dedupe and build a hash table of all the file sizes. I then go through that set of hashes and ignore anything that has only one element for a particular file size. Once I have a list of files with the same size, I then build a hash table for the SHA-256 sums for those files. (I plan on adding a preprocess to only hash the first 16k or so as a first pass to weed out large files that are actually different.) Any place where I find a match on both file size and SHA-256 hash I add to a queue to process later. Sounds a bit complex, but it works pretty well. Depending on the number of actual matches, you can go through a few terabytes in a short period of time. I hope that makes sense. _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx