Saturday, January 2, 2010

Learning F# (via sample application) — Looking for file duplicates in your filesystem

Intro — why I did write it

Time goes by, hard drives become larger and larger and it is no more strange to have more than 500 Gib of information you have downloaded or copied. After even a year spent working with PC, you are likely to have too much data to remember where you put it.

Surely, the right way to handle this is to organize your data properly, so that same (or similar) files are stored in the shared location and then duplicates are easily noticed.

The lazy way is to run a program to find duplicates for you.


Any programming language will do to solve the problem. I did it with F#. This post will only be useful for people learning F#, as I do not provide a user-friendly program (yet). And running the code (luckily) requires programming skills.

The cool feature in F# is F# interactive, which is great for rapid development and doing research. Unlike typical write-it, compile-it, run-it scenario you can go with write-it and run-it-when-you-want approach.

Heuristic algorithm to find file duplicates

Straight-forward: equal files are those with equal content, names and properties.

Let's lower this bound: properties, such as last modification date, does not matter. Name as well: you might have renamed that downloaded file half a year ago. Content ... matters.

Let's go further: if content equals, their hash values are equal too. But as reading those 500 Gib of data alone will take tremendous amount of time, let's consider first N bytes only. What we will end up with is a list of estimated duplicates. (Afterall, for those candidates we can perform a full content check).


open System
open System.IO
open System.Security.Cryptography
open System.Diagnostics

// List all files in given directory and all the subdirectories
// Surelly, there might be some files you are not given access to
// So, ignore all the exceptions
let rec allFiles dir =
            // return the list of files in the given directory
            dir |> Directory.GetFiles           
         with e -> [||])
            // dive into subfolders
            dir |> Directory.GetDirectories |> allFiles |> Seq.concat
         with e -> Seq.empty)

// As we will work with I/O, that operations can take lots of time
// Unfortunately, at this moment you cannot tell just by looking at F# Interactive window whether it is working or not
// As F# sequences are lazy, we can monitor when their elements are accessed by providing a 'proxy' sequence
// Visualize function takes a report function and a sequence to watch
// it calls the report function periodically (for not to bother you too often)
// report function has 3 arguments: current index of the sequence, current element, time span since start of enumeration
// tsReportInterval is minimal interval between subsequent calls of report function
let tsReportInterval = TimeSpan.FromMilliseconds(500.0)
let Visualize reportfunc sq =
    seq {
        let start = DateTime.Now                            // moment of start
        let dt = ref DateTime.Now                           // moment of start or last reporting
        let num = ref 0                                     // index of the current element
        for i in sq ->
            let cur = DateTime.Now
            if (cur.Subtract(!dt) > tsReportInterval) then  // if it's time to report
                reportfunc !num i (cur.Subtract(start))     
                dt := cur                                   // update time of last reporting
            num := !num + 1
            i                                               // yield element

// specify folders for consideration
// surely, yours will be different
let folders = [@"C:\Share\"; @"d:\"; @"h:\"; @"v:\"; @"w:\"; @"x:\"; @"y:\"; @"z:\"];

// list of all the files to consider
let files =
        |> allFiles    // map each path to sequence files in that hive
        |> (Visualize (fun _ i ts -> printfn "%A %A" ts i)) // put displaying proxy in between
        |> Seq.concat           // concatenate that sequences into one
        |> Seq.toArray          // now transform that lazy sequence into the final array of files

// The hashing algorithm class:
let sha = new SHA1CryptoServiceProvider()
// Number of bytes to consider
let maxLength = 10000
// buffer for reading
let (buf : byte[]) = Array.create maxLength 0uy

// retrieves a signature for a file
// return value is (exists : bool, length : ulong, signature : byte[])
let fileSign file = 
    let mutable res = (false, 0L, null) // by default, it doesn't exist
        use fs = File.OpenRead(file)
        let flLen = fs.Length
        let len = int (min (int64 maxLength) flLen)

        // read 'len' bytes        
        let mutable pos = 0
        while (pos < len) do
            let chunk = fs.Read(buf, pos, len - pos)
            pos <- pos + chunk

        // get signature            
        let sign = sha.ComputeHash(buf, 0, len)
        // store new result
        res <- (true, flLen, sign)        
    with e -> e |> ignore
    res                                 // return

// how many files are there ?
let total = files.Length

// gets percentage in float
let get_perc cur total =
    100.0 * float cur / float total
// data for all the files
let data =
        |> Visualize (fun x i ts -> printfn "%6.2f %A %A" (get_perc x total) ts i)
        |> (fun f -> (f, fileSign f))
        |> Seq.toArray
// now for each file we have (name, (exist, length, signature)) in an array

// let's value our time: small duplicates are not worth our attention
// Therefore, let's invent a threshold on size:
let considerableLength = int64 (2 * 1024 * 1024) // 2 Mib
// (Set to 0 if you do not want such)

// Let's make the data presentable:
// filter our unexisting and small,
// represent it as (key, value) pairs,
// to be more specificic, of ((len, signature), name)
// because (len, signature) tuple represent file's uniqueness
let characteristics = 
        |> Seq.filter (fun (name, (exist, len, sign)) -> exist)
        |> Seq.filter (fun (name, (exist, len, sign)) -> len >= considerableLength)
        |> (fun (name, (exist, len, sign)) -> ((len, sign), name))
        |> Seq.toArray

// Let's build an array of keys
let keys =
        |> (fun (k, v) -> k)
        |> Set.ofSeq
        |> Seq.toArray

// here is a funcation that gives array of names for the given key
let names key =
        |> Seq.filter (fun (k, v) -> k = key)
        |> (fun (prop, name) -> name)
        |> Seq.toArray

// total number of keys
let keyLength = keys.Length

// duplicates
let dups = 
        |> Visualize (fun x _ _ -> printfn "%6.2f" (get_perc x keyLength))
        |> (fun (len, sign) -> (len, names (len, sign)))
        |> Seq.filter (fun (len, names) -> names.Length > 1)
        |> Seq.toArray

// Sort dups by size descending
    |> Array.sortInPlaceWith (fun (len1, _) (len2, _) -> Math.Sign(int64 (len2 - len1)))
// path to desktop
let desktop = Environment.GetFolderPath(Environment.SpecialFolder.Desktop)
// path to report
let report = Path.Combine(desktop, DateTime.Now.ToString("D") + " - duplicates.txt");

// write dups to files
let res = new StreamWriter(report)
for el in dups do
    let (size, names) = el
    fprintfn res "size = %A" size
    for n in names do
        fprintfn res "%s" n
    fprintfn res "-----------------------------------------"

Please criticize and comment (English grammar included). Your feedback is very welcome.

No comments:

Post a Comment