diff or deduplicate two volumes with different folder structures

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 19 Sep 2016 17:23:39 -0600

Drives A and B have many overlapping files but I want to find out what
files don't exist on each. Thwarting this is directory structure
differs between the two drives, and I'm fairly certain some of the
file names differ on the two drives also.

Therefore I need something hash based. I started with this:

$ find /brickA -type f -exec md5sum "{}" + > brickA.txt
$ find /brickB -type f -exec md5sum "{}" + > brickB.txt

What I need next is to:

Make a copy of the files, brickAcopy.txt and brickBcopy.txt
Loop: Extract each md5sum in brickA.txt, grep for it in brickAcopy.txt
and brickBcopy.txt, and if it's found in both, delete the line in both
files.

What remains in each file are paths to files that don't exist on the
other drive. This must be a solved problem, so I'm open to alternative
approaches.

Both drives use Btrfs, I can create snapshots and perform a "dedup"
operation on those snapshots directly. Ideally the dedup would delete
the files in both snapshots (i.e. it'd be considered data loss if it
weren't for the snapshots) just to save time. But if necessary I'll
just do a one way dedup with the two operations reversed and suffer
the extra processing time.

Ideas?

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx