How to compare the content of two folders automatically

Many of us end up, inevitably, with so many files and folders that it is impossible to keep them under control without some specialized help. Luckily, as I’ll show you in a moment, under Linux there are several, very efficient solutions to this problem.

Multiple copies of many files, scattered all over the computer, waste space, create confusion, and slow down desktop indexers like DocFetcher. I have already explained how to find and remove the unwanted extra copies here.

When it comes time to clean up your folders and files, a common problem crops up: how can I find where duplicate files and folders exist between multiple directories? The problem is both more complex and much more common than it may appear at first sight. A directory may contain many, many levels of sub-directories, each with thousands of files of all sorts. Trying to figure out manually the differences between two directory trees like those could take days.

One reason why you need to know the differences between directories is so you can ensure that all your backups are working as expected! What if the automated backup procedure you run every day has a bug? What if a sector of the external drive(s), DVDs, or remote computer to which you continuously copy all your precious folders suddenly (and silently) broke? Would you notice it before actually needing those backups? This is the main reason to be able to quickly find out if the contents of two folders differ. Let’s see how to make this easy.
Automatic comparison

It is important to be able to run certain checks automatically from a shell script. Especially if all you want is a quick yes or no answer and automatic notifications. Here are a few command line utilities that you may use as a basis for scripts that perform such checks. You may then run those scripts either as automatic cron jobs, or whenever you feel like checking if that DVD or external drive is still free from errors.
find

This pipe of commands:

find $FOLDER -type f | cut -d/ -f2- | sort > /tmp/file_list_$FOLDER

will save in /tmp/file_list_$FOLDER an alphabetically ordered list of all the files inside $FOLDER, complete with the corresponding sub-folders, e.g. something like this:

family/health_insurance.pdf

family/holiday_quote.pdf

pictures/2012/graduation.jpg

work/linux-review.odt

Running the pipe on more directories and comparing the corresponding file lists will not find all the differences between them. You will only spot missing files, or folders containing sets of files with different names. Files with the same names and in the same subfolders, but with different content, will not show in the lists. Still, this may be a very quick way to spot certain mismatches.
diff

Diff is normally used to compare two files, but can do much more than that. The options “r” and “q” make it work recursively and quietly, that is, only mentioning differences, which is just what we are looking for:

marco #> diff -rq todo_orig/ todo_backup/

Only in todo_orig/essays: Digital-Citizenship-tech4engage-summit-report.pdf

Files todo_orig/copyright/copyright_licensing.t2t and todo_sync/copyright/copyright_licensing.t2t differ

diff: todo_orig/embedded_linux/init.d/led_driver: No such file or directory

diff: todo_backup/embedded_linux/init.d/led_driver: No such file or directory

Files todo_orig/strider/food/backpacking_food.t2t and todo_sync/strider/food/backpacking_food.t2t differ

…

As you can see, all the differences between two directory trees appear, be they files only present in one of them, or files that are different. Even files that, like “led_driver”, are present in both folders but don’t really exist, because they are links to other files that were canceled, are listed. Counting the number of lines generated by such an invocation of diff shows immediately if the two trees differ, as in this pseudo Bash code:

DIFF_NUM=`diff -rq $DIR_1 $DIR_2 | wc -l`

if [ “$DIFF_NUM” -gt “0” ]

# send me an email listing all the differences

done

rsync

Rsync can produce a difference report that you may parse and use in the same way as the one from diff:

marco #>rsync -rvnc –delete todo_sync/ todo_orig/

sending incremental file list

deleting essays/Digital-Citizenship-tech4engage-summit-report.pdf

skipping non-regular file “embedded_linux/init.d/led_driver”

strider/food/backpacking_food.t2t

sent 148763 bytes received 473 bytes 27133.82 bytes/sec

total size is 854518613 speedup is 5725.95 (DRY RUN)

The four command line switches r, v, c and n tell rsync (check the man page for details) to perform a verbose, recursive, checksum-based synchronization of the two directories, but only for show: -n, in fact, displays what rsync would do IF you did let it free to make the second folder a perfect copy of the first one. The huge advantage of rsync over rdiff is that the former can compare local directories with remote ones.

Author info:

Marco Fioretti is a freelance writer and teacher whose work focuses on open digital technologies.

Linux and Open Source Blog

How to compare the content of two folders automatically

One Response

Leave a comment

Leave a comment Cancel reply