# Identical File Finder



## xbonez (Feb 12, 2012)

I needed to sort through a directory which contained about 500 or so files (many of which were identical in content), and pick out only the unique files from the directory. I ended up writing a short program in C# to do the task for me. Posting the executable and code below, should anyone ever require it.

*Summary*

The program iterates over all the files in the input directory and checks for uniqueness by calculating the MD5 hash of the file contents. All unique files are copied over to the output directory.

*Executable:* http://dl.dropbox.com/u/1276196/DND/IdenticalFileFinder.exe

*Usage*

The program expects two command line arguments. 
The first argument is the  input directory. 
The second argument is the output directory where the unique files will get copied.

A trailing slash is optional.

*Source*


```
using System;
using System.Collections.Generic;
using System.IO;
using System.Security.Cryptography;
using System.Text;

namespace IdenticalFileFinder
{
    class Program
    {
        static string baseDir = String.Empty;
        static string outputDir = String.Empty;

        static void Main(string[] args)
        {
            getArgs(args);
            normalizeDirectories();

            List<string> hashes = new List<string>();
            string[] files = Directory.GetFiles(baseDir);

            if (!Directory.Exists(outputDir))
            {
                try
                {
                    Directory.CreateDirectory(outputDir);
                }
                catch(Exception e)
                {
                    Console.WriteLine(e.ToString());
                    Environment.Exit((int)ExitScenarios.READWRITEERROR);
                }
            }

            foreach(string fileName in files)
            {
                string hash = GetMD5HashFromFile(fileName);

                if (!hashes.Contains(hash))
                {
                    string dest = outputDir + Path.GetFileName(fileName);
                    hashes.Add(hash);
                    try
                    {
                        File.Copy(fileName, dest);
                    }
                    catch (Exception e)
                    {
                        Console.WriteLine(e.ToString());
                        Environment.Exit((int)ExitScenarios.READWRITEERROR);
                    }
                }
            }
            Console.WriteLine(String.Format("{0} unique files found in {1} total files.",hashes.Count,files.Length));
            Environment.Exit((int)ExitScenarios.SUCCESS);
        }

        public static void normalizeDirectories()
        {            
            if (baseDir[baseDir.Length - 1] != '\\')
                baseDir += "\\";
            if (outputDir[outputDir.Length - 1] != '\\')
                outputDir += "\\";
        }       

        public static void getArgs(string[] args)
        {
            if (args.Length == 2)
            {
                baseDir = args[0];
                outputDir = args[1];
            }
            else
            {
                Console.WriteLine("Incorrect arguments passed in. First argument should be input directory and second should be output directory.");
                Environment.Exit((int)ExitScenarios.ARGFAIL);
            }

            if (!Directory.Exists(baseDir)) 
            {
                Console.WriteLine("Input directory not found.");
                Environment.Exit((int)ExitScenarios.ARGFAIL);
            }
        }

        static public string GetMD5HashFromFile(string fileName)
        {
            FileStream file = new FileStream(fileName, FileMode.Open);
            MD5 md5 = new MD5CryptoServiceProvider();
            byte[] retVal = md5.ComputeHash(file);
            file.Close();

            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < retVal.Length; i++)
            {
                sb.Append(retVal[i].ToString("x2"));
            }
            return sb.ToString();
        }
    }

    public enum ExitScenarios
    {
        SUCCESS,
        ARGFAIL,
        READWRITEERROR
    }
}
```


----------



## Kreij (Feb 17, 2012)

What if two different files generate the same hash? (I know it's unlinkely, but it is possible).


----------



## FordGT90Concept (Feb 17, 2012)

Neat idea but...I would recommend doing a 1:1 compare...

1. If filesize is equal, move on to step two; else, not identical.
2. Read one chunk (4096 bytes) from file A and file B into separate buffers.
3. Use a for loop to make sure all bytes are equal: fileAbuffer_ == fileBbuffer
4. If an inequality is ever discovered, break the loop, close the file streams, and output non-match.  If they are equal, rince and repeat until End of File is reached.

Using this method, you could even have a percentage match option where you count the differences and use the filesize as the denominator in creating the percent.  If not above a user specified percentage (100% if not given), reject it.


The method described above would likely be faster than md5.ComputeHash(file) on large files because it could stop on the first byte._


----------



## xbonez (Feb 20, 2012)

@Kreiji: It'll probably crash and burn in case of a hash collision..lol. I didn't cater for that.

@Ford: Those are some good suggestions. When I first wrote the program, I just needed to filter through the 500 files I had, and threw together the first thing that came to my mind, but I might decide to re-do it (taking some of your suggestions) and package it into a WPF app.


----------



## Dolph (Mar 9, 2012)

I know this thread is a few weeks old, and correct me if im wrong but


```
public static void normalizeDirectories()
        {            
            if (baseDir[baseDir.Length - 1] != '\\')
                baseDir += "\\";
            if (outputDir[outputDir.Length - 1] != '\\')
                outputDir += "\\";
        }
```
 
Shouldn't you be using a substring for that?  Considering your taking the last character array of the string comparing it to a 2 character string ie '\\' then that will always be true.   '\' != '\\' = true

In the case of the user entering in a pathing with the '\\' or even '\' it will add another '\\'  

EX:  program.exe c:\Projects\  d:\

woudln't that see it, and correct it to c:\Projects\\\ d:\\\   ?


----------



## FordGT90Concept (Mar 9, 2012)

It is correct.  That's how C# escapes backslashes.  It makes sure that both directories always end with a backslash.  .NET's directory methods like to omit it.


----------



## Dolph (Mar 10, 2012)

Ahh, my appologies.


----------



## Aquinus (Mar 19, 2012)

Where is the rest of the source? I might be blind, but I'm not seeing where the hash is generated. You could use a different hashing algorithm or a combination of two to mitigate collisions. Maybe an MD5+SHA1 or a SHA256. SHA256 will run fast on newer Sandy Bridge and practically any AMD chip (AMD seems to do encryption pretty well, even against SB.)

Also if the directory gets too big, the List interface may be inefficient to use. There are a lot of things you can do to speed this up, but I don't know how fast it is already. For *really* large directories, using set-sized arrays stored in 2d array along with threading to process even and odd items from one array could speed it up since not a whole lot of locking is required (just on the output "list".)

Good work though, its been some time since I've written some C#.

Have you stored any numbers to see how many files your application can do per second? You could use that to judge improvements or slowdowns in your application.


----------



## Kreij (Mar 19, 2012)

The hash is being generated in the GetMD5HashFromFile() method.

The odds of a hash collision are so infitesimally small it's not really worth testing for it.

Lists (the List class) is extremely fast for sorting and searching. What would you change Aquinus?


----------



## Aquinus (Mar 19, 2012)

Kreij said:


> The hash is being generated in the GetMD5HashFromFile() method.
> 
> The odds of a hash collision are so infitesimally small it's not really worth testing for it.
> 
> Lists (the List class) is extremely fast for sorting and searching. What would you change Aquinus?



I would use a 2D array and define a new "row" of static size every time the previous row is filled. That way you're saving space in the list class and you're not resizing the array every time you exceed the maximum size. The overhead is small and you stick with just what you need. The List class has a lot of overhead to expand it beyond a standard array, where a regular array should do just as well.

If the op where to post the full source up, I wouldn't mind trying to alter it. I would have to install Visual Studio again and I wouldn't be able to get to it until I get back from work later today which is also dependent on how fussy the daughter is when I get home.

It is my understanding that the List is being used just to store values, not sort them in this instance. I'm just saying it can be optimized if it takes a long time for large sums of files.


----------



## Kreij (Mar 19, 2012)

You are right, Aquinus, but it is implimentation dependant. You know this, but I'll add the details for anyone else who might read this ...

2D arrays are faster if the upper limit of the array is know before hand (count the number of files prior to creating the array), so no array resizing has to occur.
String[] myArray = new String[500];

Generic List<> is also faster if the bounds are known 
List<string> myList = new List<string>(500) as opposed to just initializing an empty List.
The .net implementation of Lists is actually using an array behind the scenes.

Using for loops on both Lists and Arrays is faster than using foreach, but again the bounds needs to be know before hand.

Lists are a bit slower as they do bounds checking all the time, while in a predefined array no bounds checking is done (unless you code it yourself).
Lists also have a bit larger memory footprint than arrays.

The question then becomes is the performance increase enough to warrant the additional coding to impliment arrays over generic Lists. This is completely dependant upon the implimentation and what is occuring within the for loop. Complex calculations in the for loop may make the performance increase with arrays irrelevant (when each for iteration is analyzed).

Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.

On a side note, it looks to me like all the code is in the OP. Am I overlooking something?


----------



## Aquinus (Mar 20, 2012)

Kreij said:


> You are right, Aquinus, but it is implimentation dependant. You know this, but I'll add the details for anyone else who might read this ...
> 
> 2D arrays are faster if the upper limit of the array is know before hand (count the number of files prior to creating the array), so no array resizing has to occur.
> String[] myArray = new String[500];
> ...



Great post. You're right, my stupid mac just likes hiding the scroll bars on textareas.


----------



## FordGT90Concept (Mar 20, 2012)

Kreij said:


> Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.


I used lists for animation tables having 32,000+ ushort values.  It was quick.  I used the default start size too (I think it's 64).  Performance is getting less and less important when memory is so fast and plentiful. XD




Kreij said:


> On a side note, it looks to me like all the code is in the OP. Am I overlooking something?


Yeah, the haystack (folders/files)!


----------



## xbonez (Mar 21, 2012)

I tend to stay away from micro-optimizations. They are almost always pointless unless you work in an extremely time sensitive context.

As for using another hashing algorithm: The SHA2 family is a better hashing algorithm for security purposes, primarily because it is not as fast as MD5. This makes brute forcing or generating rainbow tables for SHA2 hashes much slower. SHA2, however, also has the possibility of hash collisions. 

So far, there does not exist a perfect hashing algorithm that guarantees no collisions over an infinitely large sample set.



> "More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason — including blind stupidity." — W.A. Wulf





> "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." — Donald Knuth


----------



## Aquinus (Mar 21, 2012)

xbonez said:


> I tend to stay away from micro-optimizations. They are almost always pointless unless you work in an extremely time sensitive context.
> 
> As for using another hashing algorithm: The SHA2 family is a better hashing algorithm for security purposes, primarily because it is not as fast as MD5. This makes brute forcing or generating rainbow tables for SHA2 hashes much slower. SHA2, however, also has the possibility of hash collisions.
> 
> So far, there does not exist a perfect hashing algorithm that guarantees no collisions over an infinitely large sample set.



It depends on what kind of data you're working with and what language you're using. I mostly write PHP and a lot of SQL. Now the PHP usually runs fast enough unless you're going through a large array like a huge sql result set, but on the other hand I could have SQL that take a long time to run because I might be joining 6 or 8 very large tables to each other. So optimizations will depends, which is why I asked how many files he was comparing because if you're doing 100,000+ files the method that lists search may be inefficient for what you're trying to do.

But you're right, premature optimizations are usually a bad idea.

Also, nice quote by Donald Knuth.


----------



## djxinator (Mar 21, 2012)

I know its a bit off topic, and I have pretty much NO IDEA whats going on in here but its sparked up an old interest of mine.

Is it feasible to self-teach C#, and if so, how long to get a basic understanding and foundation for creating simple yet useful applications?


----------



## Kreij (Mar 21, 2012)

Yes, I taught myself C# (and all the rest of languages I know for that matter).
If you are familiar with OOP concepts and programming I would estimate you would be reasonably comfortable with it in a couple of weeks.


----------



## FordGT90Concept (Mar 21, 2012)

Aquinus said:


> So optimizations will depends, which is why I asked how many files he was comparing because if you're doing 100,000+ files the method that lists search may be inefficient for what you're trying to do.


The hard drive/solid state drive itself is the bottleneck, not the code.  You could spend days improving the code but performance gains will be virtually nonexistant when 99.9% of the time spent reading the files.


The most efficient method of comparison (100% match) would be a hybrid:
1) check filesize (file system look ups are fast and simple)
2) hash (hashes spot check the file and take a snapshot of it)
3) if there is a hash collision (match), do 1:1 compare to be 100% certain it is a match.

This is slower than performing a single check but it is faster than always reading the entire file too.


----------



## Kreij (Mar 21, 2012)

@Ford
I don't think that step #3 is even needed. The probabiliy of two MD5 hashes from differing files colliding is what, 2^-128 ?
Even if your were hashing a million files the odds of a hash collision on differing files is astronomically small.


----------



## FordGT90Concept (Mar 21, 2012)

Actually, i just did a test using points 1-3 above.  Times for each method:
1) 0.0010000 seconds
2) *9.4845425 seconds*
3) *8.9955145 seconds*
Result = Match

Second run just to make sure it is no flook:
1) 0.0000000 seconds
2) *10.6946117 seconds*
3) *8.1694673 seconds*
Result = Match

Both files are 594,778,936 bytes (567 MiB).  Original filename: C:\Program Files (x86)\Mass Effect 3\BIOGame\DLC\DLC_HEN_PR\CookedPCConsole\Default.sfar

I'm going to change 1 byte and see if that impacts performance.  I will have to test 2 and 3 separately because 2 won't fall to 3 if the hash isn't a match.  In order to give 2 the best possible chance, the byte I modify will be at the very end so 3 must process the entire file to find it (when in actuallity, it could stop on the first byte)...

File: b
Offset: 594778935
Old Value: 0x7B
New Value: 0x00

1 & 2 test:
1) 0.0000000 seconds
2) *26.9975442 seconds*
Result = FailedFileHash

1 & 3 test:
1) 0.0010000 seconds
3) *12.2507007 seconds*
Result = FailedFile

That's no joke, it's 220% faster and 100% accurate every time!  


To prove that method 3 can return no match in an instant, I'm undoing the change above and adding the following:
File: b
Offset: 1048576 (1 MiB in)
Old Value: 0xB0
New Value: 0x00

1 & 2 test:
1) 0.0010001 seconds
2) *24.3733941 seconds*
Result = FailedFileHash

1 & 3 test:
1) 0.0010001 seconds
3) *0.0610035 seconds*
Result = FailedFile

Hash. Got. Owned.  1:1 compare with break was 39,954% faster!



```
using System;
using System.IO;
using System.Security.Cryptography;

namespace ConsoleApplication1
{
    public enum CompareResult : sbyte
    {
        Match = 0,
        FailedFileSize = -1,
        FailedFileHash = -2,
        FailedFile = -4
    }
    class Program
    {
        static void Main(string[] args)
        {
            string a = @"C:\Users\Admin\Desktop\a";
            string b = @"C:\Users\Admin\Desktop\b";

            DateTime start = DateTime.Now;
            Console.WriteLine("       Result: " + DoCompare(a, b).ToString());
            Console.WriteLine("   Total time: " + (DateTime.Now - start).ToString());
            Console.ReadKey();
        }

        public static CompareResult DoCompare(string a, string b)
        {
            DateTime start = DateTime.Now;
            if (CompareFileSize(a, b))
            {
                Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());

                start = DateTime.Now;
                if (CompareFileHash(a, b))
                {
                    Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());

                    start = DateTime.Now;
                    if (CompareFile(a, b))
                    {
                        Console.WriteLine("    File time: " + (DateTime.Now - start).ToString());
                        return CompareResult.Match;
                    }
                    else
                    {
                        Console.WriteLine("    File time: " + (DateTime.Now - start).ToString());
                        return CompareResult.FailedFile;
                    }
                }
                else
                {
                    Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());
                    return CompareResult.FailedFileHash;
                }
            }
            else
            {
                Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());
                return CompareResult.FailedFileSize;
            }
        }

        public static bool CompareFileSize(string a, string b)
        {
            if (new FileInfo(a).Length == new FileInfo(b).Length)
                return true;
            else
                return false;
        }
        public static bool CompareFileHash(string a, string b)
        {
            MD5 md5 = new MD5CryptoServiceProvider();

            FileStream fs = new FileStream(a, FileMode.Open, FileAccess.Read);
            byte[] ah = md5.ComputeHash(fs);
            fs.Close();

            fs = new FileStream(b, FileMode.Open, FileAccess.Read);
            byte[] bh = md5.ComputeHash(fs);
            fs.Close();

            for (byte i = 0; i < ah.Length; i++)
            {
                if (ah[i] != bh[i])
                    return false;
            }
            return true;
        }
        public static bool CompareFile(string a, string b)
        {
            const int BUFFER_SIZE = 4096;
            byte[] buffera = new byte[BUFFER_SIZE];
            byte[] bufferb = new byte[BUFFER_SIZE];

            FileStream fsa = new FileStream(a, FileMode.Open, FileAccess.Read);
            FileStream fsb = new FileStream(b, FileMode.Open, FileAccess.Read);

            long reada, readb;

            bool match = true;
            do
            {
                reada = fsa.Read(buffera, 0, BUFFER_SIZE);
                readb = fsb.Read(bufferb, 0, BUFFER_SIZE);

                for (short i = 0; i < BUFFER_SIZE; i++)
                {
                    if (buffera[i] != bufferb[i])
                    {
                        match = false;
                        break;
                    }
                }

                if (!match)
                    break;
            } while ((reada != 0) && (readb != 0));

            fsa.Close();
            fsb.Close();

            return match;
        }
    }
}
```
Morale of the story: don't waste your time on hashes.  FileSize -> 1:1 compare is best in every scenario.

#2 above should be replaced with a custom function that spot-checks the file (e.g. every 1024 bytes) instead of using hashes.  If the spot-check doesn't find a difference, move on to the 1:1.


----------



## Kreij (Mar 21, 2012)

Makes sense to me. Sequential reading of a pair of arrays is fast (and of course, doesn't have to do an encryption algorithm even if the whole file is read).

In the last scenario, it probably would be even faster as the odds of two differing files being identical for the 1MiB is unlikely also and it would find the difference sooner.

Now get in there and optimize it more to make it even faster. 

In a worse case scenario (let's say 5 files with a differing last byte) you would have to do full file reads 10 times for the array method, but only 5 hash runs and then do a hash List compare.
Seems in this case they would be a lot closer.


----------



## FordGT90Concept (Mar 21, 2012)

Yeah, 90% of files would be different on offset 0, 9% at offset 3-31, and 1% in the rest of the file.


I dunno what I could do to make it faster except replace my 2x250 GB HDDs for an SSD.   <10 seconds for a >500 MiB file is pretty fast.




Kreij said:


> In a worse case scenario (let's say 5 files with a differing last byte) you would have to do full file reads 10 times for the array method, but only 5 hash runs and then do a hash List compare.
> Seems in this case they would be a lot closer.


True, which is why the spot-check function would be important.  It could store the spot-check for every file making it much faster when comparing many files to many files.


----------



## Kreij (Mar 21, 2012)

Ford said:
			
		

> I dunno what I could do to make it faster



Multi-threading FTW !!!
(since disk access if the bottleneck that would not really improve anything unless the files were on seperate disks/controllers.)

BTW ... I still use Nettool all of the time. What a spectacular little utility. 
I'm using it today to help troubleshoot a workstation connections problem (logged into work network from home through TeamViewer).
It's my goto utility for tier 1 network troubles.
Tier 2 is a big hammer.


----------



## FordGT90Concept (Mar 21, 2012)

Multithreading only benefits memory/CPU tasks.  Like I said before, the bottleneck is the HDD/SDD/RAID.  Multithreading wouldn't improve performance.


Heh, I'm glad someone is using it. XD


----------



## TheMailMan78 (Apr 18, 2012)

Kreij said:


> Yes, I taught myself C# (and all the rest of languages I know for that matter).
> If you are familiar with OOP concepts and programming I would estimate you would be reasonably comfortable with it in a couple of weeks.



I tried to teach myself but I had no basic understanding of any of the concepts. It was like trying to paint without knowing what a paint brush was.


----------



## Kreij (Apr 18, 2012)

TheMailMan78 said:


> I tried to teach myself but I had no basic understanding of any of the concepts. It was like trying to paint without knowing what a paint brush was.



Simple solution. Let us teach you.
It's free but you have to put up with our crap. Buy hey !! It's free.


----------



## driver66 (Apr 18, 2012)

It's a TRAP !!!!


----------



## Mindweaver (Apr 19, 2012)

Kreij said:


> Simple solution. Let us teach you.
> It's free but you have to put up with our crap. Buy hey !! It's free.





TheMailMan78 said:


> I tried to teach myself but I had no basic understanding of any of the concepts. It was like trying to paint without knowing what a paint brush was.



Yepper! Ask away!  With new versions coming out soon.. Like VS2011 we all can learn something new.


----------

