Dupseek
A command-line interactive perl program to find and remove duplicate files.Algorithm
A few strategies are possible for finding duplicate files in a big set, such as a heavily populated directory.
One of the most widely used consists of grouping files by size (because files of different size can't be identical) and then computing a short digital fingerprint (such as a md5 checksum) for the files. Files with a different fingerprint are different, and files with the same digital fingerprint are very probably the same. Just to be sure, one can further check possible duplicates.
Dupseek does something different:
- It starts by grouping files by size.
- Then it starts reading small chunks of the files of the same size and comparing them. It creates smaller groups depending on these comparisons.
- It goes on with bigger and bigger chunks (of size up to a hard-coded limit).
- It stops reading from files as soon as they form a single-element group or they are read completely (which only happens when they have a very high probability of having duplicates).
Partial execution
Dupseek (and destroy) can be interrupted at any moment. The user is then presented with partial results and can either intervene manually or go on with the reading and computation, on a group-by-group basis. Since subsequent reads happen sparsely in the file, if some files are still in the same group after many iterations, they are most probably identical, unless the differences are very small.
Platforms
Dupseek was reported to run on the following platforms:
- Debian GNU/Linux "Woody" and "Sarge"
- Mac OS X v10.2.6
- Freebsd 4.7
Dependencies
Dupseek was developed with perl 5.6.1 and was also tested with perl 5.8.4. It relies on the following modules:
- File::Find directory recursion;
- IO::File object-oriented file handles;
- Getopt::Std option parsing
License
Dupseek (and destroy) is Copyright Antonio Bellezza 2003-2005. It is released under the GPL v2. Here is the license notice:
This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation; This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Beware
The program destroys files. Starting from version 1.1, it can also do it in an automatic way, and mistakes can happen, on the user's or programmer's part. So, be warned!!!
Usage
dupseek -h outputs a help page.Hit Ctrl-C to interrupt interactive execution and be presented with partial results.
Credits
I would like to thank Henry Laxen for sending me his patch implementing batch processing and option parsing (see credits.txt).
My thanks also go to Glenn Powers for extensive testing on Mac OS X and pointing out the problem with changing files/directories.
Download
The latest version is
You can also download the older releases
Bugs
- (Corrected in 1.3)
If a directory is entered twice, or is contained in another one, then all its files are found twice and identified as duplicates. This can be a VERY DANGEROUS SITUATION - Dupseek gets confused if files are modified/moved while it's working. Starting from version 1.1, you should avoid making any changes to the folders you are checking while dupseek is running.
- Testing under other platforms was not carried out. Please, send me some feedback if you are brave enough to use dupseek on a different OS.