Problem: You have two large sets of ANSI text files. You suspect that some of the files in one set are duplicates, or near duplicates, of files in the other set. You want to remove duplicate files so that there are no files common to both sets. But the number of files in these sets is so large (say, ten or more in each set) that it is impractical to compare the two sets by inspecting each file in one set and comparing it with all the files in the other set. A program to find exact duplicates might give some matches, but that depends on the files being exactly the same. What if some files in one set are almost the same as files in the other set, or are the same except for some header text at the start?
Have a problem such as this that requires a custom program? Then please contact us to discuss the problem.
Solution: You put each set of files into a separate folder. Your run Duplicate Text Finder and specify the two folders. You click on the Start button and the program begins comparing all the files in the 2nd folder with the files in the 1st folder. If it finds any duplicates then it tells you.
When the program starts up the first time it looks like this:
Download Translate to: Software Shop Home Page
How to Test the Trial VersionYou can test the trial version as follows: In some temporary folder create two subfolders, say, 1st folder and 2nd folder. Download the zip file dtf_test_files_1st_folder.zip and extract the files into the subfolder 1st folder Then download the zip file dtf_test_files_2nd_folder.zip and extract the files into the subfolder 2nd folder. Download and install the trial version (follow the link below). Run the program and specify 1st folder as the 1st folder and 2nd folder as the 2nd folder. Then click on the 'Start' button and the result should be as shown below:
What the Program DoesWhen the program is started (after the 1st and 2nd folders are specified) it first reads an initial part of each file in the 1st folder, extracts the words and stores them. Then for each file in the 2nd folder it reads an initial part of that file, extracts the words, and compares them with the words extracted from each of the files in the 1st folder. The words to not have to match exactly, but should be close. If it finds a match then it displays the file names and (optionally) the words from each file which justify the match. If matches are found then you can copy the results to the clipboard or save them to a file, then eliminate the duplicates if you wish.
There are a few ways to adjust the operation of the program.
- If the files in the 1st folder have some header text which should be passed over before comparison with the files in the 2nd folder, then you can specify the number of words in the files in the 1st folder which should be skipped. Similarly for the 2nd folder. Normally these values will be zero, so that the comparison starts at the beginning of each file.
- The program compares segments of text in each file. The default value for the number of words in the test segment is 50. Normally this should not be changed, but you might find that changing it gives better results for particular sets of files.
- The program looks for matches which are either exact or nearly exact. The threshold for similarity value determines how exact a match should be in order to be reported as a match. The default value is 90. Lowering this might find more matches, or it might return some false positives.
- If and when matches are found you can specify how many words in the test segment should be displayed. The default value is 12.
- After specifying the 1st and 2nd folders, you can swap them. It is recommended that, in order to minimize the time required for the comparison of the files in each folder, the set of files with the fewest files be placed in the 1st folder.
How to obtain the software: A fully-functional version of the Duplicate Text Finder software is available for free download from this website. Click on the following link to go to a web page with further information:
Download Duplicate Text Finder ...
Hermetic Systems Home Page