On Tue, Sep 20, 2016 at 10:52:10PM +0200, Ahmad Samir wrote:
One last try (sometimes an issue nags): $ find A -exec md5sum '{}' + > a-md5 $ find B -exec md5sum '{}' + > b-md5 $ cat a-md5 b-md5 > All $ sort -u -k 1,1 All > dupes
Now, (I hopefully got my head around it this time...), the dupes file should contain a list of files that exist in _both_ A and B; but every two files that have the same md5sum will have _only one_ of them listed (either in A OR B). So if you delete that list of files you should end up with only unique files in both locations.
At the start ISTR you said the two directory trees were different. I took that to mean that two files with identical contents could be in different directories within the two trees.
If I was wrong in that assumption and each pair of identical files would be in the same relative path I have two suggestions.
Sort a-md5 and b-md5 Use the comm(1) command. It will give lines in both files, in file a-md5 only and in b-md5 only with 0, 1, or 2 tabs. You can also use options to get the 3 columns individually. To do this you would have cd to A or B and run the find cmds as "find .", not "find A or B".
Get a copy of an old program called dircmp* and run it on the two trees directly. It will output files only in tree A, only in tree B, then output files in both noting whether they are the same or different contents.
I don't have the compiled version of dircmp, but I have a ksh shell script version that is quite similar.
Don't use MD5. You will get unintentional file collisions. (SHA-256 is good. It depends on just how much you are comparing.)
What I use is a perl script that takes the directories I want to dedupe and build a hash table of all the file sizes. I then go through that set of hashes and ignore anything that has only one element for a particular file size. Once I have a list of files with the same size, I then build a hash table for the SHA-256 sums for those files. (I plan on adding a preprocess to only hash the first 16k or so as a first pass to weed out large files that are actually different.) Any place where I find a match on both file size and SHA-256 hash I add to a queue to process later.
Sounds a bit complex, but it works pretty well. Depending on the number of actual matches, you can go through a few terabytes in a short period of time.
I hope that makes sense.