Hermetic Word Frequency Counter Advanced Version
Ignoring Common Words in a Particular Language

If there are only a few words that you want the program to ignore, you can put these in the 'Extra words to ignore' textbox. But when counting words in a natural language (e.g. English) you will normally want it to ignore ignore common words (e.g. the, with, my, etc.). These words are contained in some file. Seven files are provided with the software, containing common words in English, German, Spanish, Italian, French and Portuguese, plus a file with common words in all six languages.

When the program is first run, it asks whether you wish to load the file of common words in English. If you answer 'Yes' then the file is loaded, and the 'Ignore common words' part of the 'Settings' window looks like the image below:

If you are working only with English text and you wish to ignore common words in English then you can skip the rest of this page.

If on startup you answer 'No' then the file of common words in English is not loaded, and the 'Ignore common words' part of the 'Settings' window looks like this:

If you are scanning files in, for example, German, and you wish to ignore common words in German, then select that language from the drop-down list, and after confirming your choice the file is loaded and the 'Ignore common words' part of the 'Settings' window looks like this:

The common-words-to-ignore files in each of the six languages can be modified by adding or removing words, but the file name and location should not be changed. You can add or remove words as you wish, and words do not have to be in alphabetical order or on separate lines. Words within a single line must be separated by a space (not a comma or a comma+space). No distinction is made between upper and lower case in words to be ignored. For example, if and is to be ignored then AND will also be ignored. The common words file must consist only of text (that is, binary files such as MS Word files cannot be used).

It is possible to use a custom words-to-ignore file which is either a modification of one of the common-words-to-ignore files, or is a file containing words-to-ignore in some other language, or is a file containing words-to-ignore which are not words-to-ignore in any language but simply words to ignore. To specify such a file as the words-to-ignore file click on the 'Words-to-ignore files' button and select the file in the usual way, to obtain, for example:

A words-to-ignore file contains just words to ignore, not phrases to ignore. But if you are counting all phrases in a file or set of files then there is a way to eliminate phrases, as follows: In a words-to-ignore file put all the words in the phrases you wish to exclude, and in the 'Count all phrases' panel check the 'Exclude phrases consisting only of words to ignore' checkbox. This may have an unintended side effect. For example, if you wish to ignore the phrase Jack loves Jill, and you include Jack, loves and Jill in a words-to-ignore file, then (when counting phrases) the program will also ignore the phrase Jill loves Jack.

As stated above, seven files are provided, containing common words in English (cwds_en.txt, 353 words), German (cwds_de.txt, 373), French (cwds_fr.txt, 325), Italian (cwds_it.txt, 273), Spanish (cwds_es.txt, 277), Portuguese (cwds_pt.txt, 276), and a file of common words in all six languages (cwds_en_de_es_pt_it_fr.txt). These files are in the folder containing the program files (created during program installation), and there is a download link in the Windows Explorer program menu after installation to a ZIP file containing all seven files.

These files are all ANSI files, and the output file is also an ANSI file. It is possible to use words-to-ignore files which are UTF-8 encoded.

When a file of words-to-ignore is selected, it is added to a drop-down list just below the language selection drop-down list. This allows you to switch between files of words-to-ignore (e.g., from words-to-ignore in English to words-to-ignore in Spanish, and back again). Modification of one of these lists may affect the other. Coordination between the two of them depends on the files for common words in a particular language having the same file names as those stated above.

Introduction User Manual: Contents
Hermetic Systems Home Page