A Customizable Word Count Program for Windows

This software scans a text or text-like file (including HTML and XML files encoded via ANSI or UTF-8, but not PDF or MS Word doc files), and counts the number of occurrences of the different words (optionally ignoring common words such as the and this). It is possible to specify exactly what counts as a word (e.g., words with hyphens or numerals). The words which are found can be listed alphabetically or by frequency, with rank and frequency count displayed for each word.

There are two versions of this word count software: basic (WFC) and advanced (WFCA, which does everything that WFC does, except that at present it does not read text files encoded via UTF-8). The main differences are that WFC counts words only in single text and text-like files, whereas WFCA counts words in multiple files in a single operation and also counts specified phrases. If you need to count words in only one file at a time then WFC may be what you need. If you have many files or need more options and greater functionality, then you need WFCA. Click on this link for the WFCA page.

Here is a typical screenshot, showing the results of ascertaining word counts in a 2.63 MB file, with common words ignored and with the words sorted by frequency:

Hermetic Word Frequency Counter screenshot

There is theoretically no limit on the size of an input file or the number of words in a file.

ANSI is the single-byte text encoding which is the default encoding on your PC. UTF-8 is a variable-byte-length encoding of Unicode characters, often used in HTML and XML files.
This software counts words in text and text-like files (including HTML and XML files) in which the text is encoded via ANSI or UTF-8. It does not act directly on MS-Word .doc files, which are binary files; such files can be scanned if saved as "Plain Text" files (see Scannable Files).

The program counts the frequencies of all words in the file (or optionally all words other than common words). If you just want to count the occurrences of a single word (or of each word in a set of words, or of any word matching a given pattern) then you can do this with the Advanced Version of this program.

The 'rank' and 'frequency' values may each be included in, or excluded from, the displayed results.

If the output file consists only of words, with no rank or frequency count values, then you can get these either as a list (one word per line) or as comma-separated. This is done by making the appropriate selection in the Display format drop-down menu.

What Counts as a Word?

Hermetic Word Frequency Counter is compatible with Windows 7 This program is intended mainly for counting words in natural language text and in documents containing natural language text plus markup such as found in HTML and XML files. Thus there are some restrictions on which characters are admissible in words, and (for some characters) whether they may occur at the start or end of a word.

The word 'word' usually means a word in a natural language such as English or German, but for this software it has an extended meaning. A word is a sequence of characters bounded by spaces, but it is necessary to specify which characters exactly are admissible in words.

The following characters are not admissible in words: plus signs (+), semicolons (;), double quotes (") and left and right angle-brackets (<>). In the Advanced Version the tilde (~) is also not permitted.

A word may begin or end with any alphabetic character and with any admissible non-alphabetic character (if such a character is allowed in the Settings window) except for an apostrophe or a period (and, in the Advanced Version, except for a comma or a parenthesis).

Periods and @-signs may (if allowed in the Settings window) occur within a word, thus allowing you to count email addresses. Allowing colons, forward slashes, hyphens, underscores and periods in a word allows you to count URLs.

The fact that the Advanced Version allows words with commas and parentheses means that the names of chemical compounds can be treated as single words, e.g., 2,5-dimethoxy-4-(N-propyl-thio)benzaldehyde. (For more details on this possibility see here.)

