John Ratcliff's Code Suppository

A place where I insert my code into the anus of the Internet.

 

Thursday, November 03, 2011

GatherPictures : A console app to gather all JPEG files into one location without duplicates


Today I'm providing a little console application that I wrote for my own personal needs.  On the theory that if I found it useful, probably some other people might find it useful too, I'm making it available here.


The tool is provided as a ZIP file.  It includes both the source code and the Windows console application.  It is designed to be run from the command line, so if you are a Windoze user, this might not be the tool for you.

So, here is the problem I was trying to solve.  I think many other people may have this problem too.  Over the years you transfer files from various memory cards, digital cameras or Iphones.  You have them on a bunch of different places on your hard drive.  You have them on your computer, you have them on your wife's computer you have them on old CD-ROMs or external USB drives.

Then, one day, you would like to simply get a copy of all the photographs you have scattered around all these locations into one spot, without any binary duplicates.  This tool does not do an inexact match, it won't detect images which are the same, but at different resolution or have been cropped or rotated.  There are commercial tools that can perform this operation fairly well.  Personally I use Visual Similarity Duplicate Image Finder.

This tool also does not do image resizing, there are many commercial tools that can do that as well.

What the tool is designed to do is collect all of my pictures, from many different sources, into one single flat directory so that I can then, easily, run these commercial tools on them to create a nice clean set of non-duplicate images ready to be copied onto a memory card to put into a digital picture frame.

There is nothing particularly special or magical about this source code.  It's nothing I'm proud of in any particular way, it's just something I hacked up real quick and seems to work.

Now, here is how you use the tool.  Let's say you have a bunch of pictures in a directory called 'MyPhotographs' and under that directory are many sub-directories.  Now, let's say you have another directory called 'WifesPhotographs' which may, or may not, have some of the same pictures duplicated.

The way to use this tool is you create a folder where you want to 'gather' these images into.

Next, go to that directory and at the command prompt and run the GatherPictures.exe (should be placed somewhere that your default search path can find):

GatherPictures c:\MyPhotographs

This will find all JPG files in that directory and all sub-directories and copy them into your current directory giving each image a unique flat file name.

For example, if you had an image 'c:\MyPhotographs\vacation\img001.jpg'

It would get copied as the file name 'MyPhotographs_vacation_img001.jpg'

If after running 'GatherPictures' against 'c:\MyPhotographs' you run it a second time, then it will detect that all the files are the same and it won't copy anything.  On the other hand, if new photographs get added to that directory, it will detect the new ones and pick them up.

You can continue to run 'GatherPictures' against any directory and it will avoid copying any binary duplicates.

Once you have 'gathered' all of your pictures you may find that you have gigs and gigs of photographs.  For example, I had about 50gb of photographs in 38,000 files.  One of the issues I ran into is that when I tried to run some image processing tools on this directory they crashed because they had never been designed to handle that many images.  Also, there is the use case where you might want to break these up into say 8gb or 16gb chunks suitable for copying onto a memory card.

I added a new feature to 'GatherPictures' which will take all of the pictures in the source directory and copy them into sub-directories of a specified size limit.  The original pictures are left alone.

The usage for this is:

GatherPictures -split

So, if you wanted to split all files into 1gb sub-folders you would use:

GatherPictures -split 1024

(A gigabyte is 1,024 megabytes)

For those who are interested, here is how the code works.

At start up the first thing it does is scan all JPG files in the current directory.  It examines the size of each file and places it into a hash map.  It also builds a hash map of all file names.

Next, after scanning all of the files in the source directory, it does the following.

First, it makes sure the file name doesn't match one that has already been gathered.  It uses the previously built hash map to speed this up.  If it matches, it skips it.

If the file name doesn't match it then looks up the binary file size against the hash map of files by size.  If there is already one, or more, files of that exact size it then tests to see if it is or is not duplicate.

First, it reads the source file into memory and computes a 32 bit CRC.  Next it compares that CRC to the CRC of the previously registered files.  If it is the first time the file has been checked, it has to read the file into memory to compute the CRC.  However, it caches the CRC so this is only done once.  The next time it encounters a file of this same size, it can just use the cached CRC for an early reject.

If the CRCs match only then does it do an exact binary memory compare to see if the two files are, in fact, exactly identical.

Let me know if you find the tool useful or can think of improvements.  As I said, my goal was not to replace the functionality of commercial duplicate image finder tools but, rather, I just wanted a way to collate all my image files into one single location giving each file a unique file name based on it's original source directory location.