Finding Duplicate Files

Fri Mar 14 20:48:15 UTC 2008

> "Jonathan Roberts" <jonathan.roberts.uk at googlemail.com> writes:
>
>> I have several folders each approx 10-20 Gb in size. Each has some
>> unique material and some duplicate material, and it's even possible
>> there's duplicate material in sub-folders too. How can I consolidate
>> all of this into a single folder so that I can easily move the backup
>> onto different mediums, and get back some disk space!?
>
> An rsync-y solution not yet mentioned is to copy each dir 'd' to
> 'd.tidied' while giving a --compare-dest=... flag for each of the
> _other_ dirs.  'd.tidied' will end up stuff unique to 'd'.  You
> can then combine them all with 'tar' or 'cp' or whatever.
>
> You could use the 'sha1sum' tool to test that files in the
> *.tidied dirs really are unique.
>
> This technique will catch identical files with like names, e.g.
>
>    d1/foo/bar/wibble
>    d2/foo/bar/wibble
>
> but not
>
>    d1/foo/bar/wibble
>    d2/bar/wibble/foo/wobble
>
> (if that makes sense). rsync --compare-dest and --link-dest : fantastic.

I wrote a program MANY years back that searches for duplicate files. (I
had a huge number of files from back in the BBS days that had the same
file but different names.)

Here is how I did it. (This was done using Perl 4.0 originally.)

Recurse through all the directories and build a hash of the file sizes. 
Go through the hash table and look for collisions.  (This prevents you
from doing an MD5SUM on very large files that occur once.)  For each set
of collisions, build a hash table of MD5SUMS (the program now uses
SHA512).  Take any hash collisions and add them to a stack. Prompt the
user what to do with those entries.

There is also another optimization to the above.  The first hash should
only take the first 32k or so.  If there are collisions, then hash the
whole file and check for collisions on those.  This two pass check speeds
things up by a great deal of you have many large files of the same size. 
(Multi-part archives, for example.)  Using this method I have removed all
the duplicate files on a terabyte drive in about 3 hours or so.  (Without
the above optimization.)

BTW, for all you patent trolls...  The idea was also independently done by
Phil Karns back in the early 90s as well.  (I think I did this around
1996-97.)

I have not publically released the source for a couple of reasons.  1) the
program is a giant race condition.  2) the code was written before I was
that great of a Perl programmer and has some real fugly bits in it.

I am rewriting it with lots of added features and optimizations beyond
those mentioned above.