Many of us end up, inevitably, with so many files and folders that it is impossible to keep them under control without some specialized help. Luckily, as I’ll show you in a moment, under Linux there are several, very efficient solutions to this problem.
Multiple copies of many files, scattered all over the computer, waste space, create confusion, and slow down desktop indexers like DocFetcher. I have already explained how to find and remove the unwanted extra copies here.
When it comes time to clean up your folders and files, a common problem crops up: how can I find where duplicate files and folders exist between multiple directories? The problem is both more complex and much more common than it may appear at first sight. A directory may contain many, many levels of sub-directories, each with thousands of files of all sorts. Trying to figure out manually the differences between two directory trees like those could take days.
One reason why you need to know the differences between directories is so you can ensure that all your backups are working as expected! What if the automated backup procedure you run every day has a bug? What if a sector of the external drive(s), DVDs, or remote computer to which you continuously copy all your precious folders suddenly (and silently) broke? Would you notice it before actually needing those backups? This is the main reason to be able to quickly find out if the contents of two folders differ. Let’s see how to make this easy.
Automatic comparison
It is important to be able to run certain checks automatically from a shell script. Especially if all you want is a quick yes or no answer and automatic notifications. Here are a few command line utilities that you may use as a basis for scripts that perform such checks. You may then run those scripts either as automatic cron jobs, or whenever you feel like checking if that DVD or external drive is still free from errors.
find
This pipe of commands:
find $FOLDER -type f | cut -d/ -f2- | sort > /tmp/file_list_$FOLDER
will save in /tmp/file_list_$FOLDER an alphabetically ordered list of all the files inside $FOLDER, complete with the corresponding sub-folders, e.g. something like this:
family/health_insurance.pdf
family/holiday_quote.pdf
pictures/2012/graduation.jpg
work/linux-review.odt
Running the pipe on more directories and comparing the corresponding file lists will not find all the differences between them. You will only spot missing files, or folders containing sets of files with different names. Files with the same names and in the same subfolders, but with different content, will not show in the lists. Still, this may be a very quick way to spot certain mismatches.
diff
Diff is normally used to compare two files, but can do much more than that. The options “r” and “q” make it work recursively and quietly, that is, only mentioning differences, which is just what we are looking for:
marco #> diff -rq todo_orig/ todo_backup/
Only in todo_orig/essays: Digital-Citizenship-tech4engage-summit-report.pdf
Files todo_orig/copyright/copyright_licensing.t2t and todo_sync/copyright/copyright_licensing.t2t differ
diff: todo_orig/embedded_linux/init.d/led_driver: No such file or directory
diff: todo_backup/embedded_linux/init.d/led_driver: No such file or directory
Files todo_orig/strider/food/backpacking_food.t2t and todo_sync/strider/food/backpacking_food.t2t differ
…
As you can see, all the differences between two directory trees appear, be they files only present in one of them, or files that are different. Even files that, like “led_driver”, are present in both folders but don’t really exist, because they are links to other files that were canceled, are listed. Counting the number of lines generated by such an invocation of diff shows immediately if the two trees differ, as in this pseudo Bash code:
DIFF_NUM=`diff -rq $DIR_1 $DIR_2 | wc -l`
if [ “$DIFF_NUM” -gt “0” ]
do
# send me an email listing all the differences
done
rsync
Rsync can produce a difference report that you may parse and use in the same way as the one from diff:
marco #>rsync -rvnc –delete todo_sync/ todo_orig/
sending incremental file list
deleting essays/Digital-Citizenship-tech4engage-summit-report.pdf
copyright/copyright_licensing.t2t
skipping non-regular file “embedded_linux/init.d/led_driver”
strider/food/backpacking_food.t2t
sent 148763 bytes received 473 bytes 27133.82 bytes/sec
total size is 854518613 speedup is 5725.95 (DRY RUN)
The four command line switches r, v, c and n tell rsync (check the man page for details) to perform a verbose, recursive, checksum-based synchronization of the two directories, but only for show: -n, in fact, displays what rsync would do IF you did let it free to make the second folder a perfect copy of the first one. The huge advantage of rsync over rdiff is that the former can compare local directories with remote ones.
Author info:
Marco Fioretti is a freelance writer and teacher whose work focuses on open digital technologies.
I’m curious to find out what blog platform you’re utilizing?
I’m experiencing some minor security problems with my latest
blog and I’d like to find something more secure.
Do you have any solutions?