Hermetic Word Frequency Counter
Scannable Files

A file upon which the program acts can have any filename extension, but it must consist entirely of text, encoded in ASCII/ANSI format or in Unicode UCS-2 (= UTF-16) format. The input file typically consists of natural language text (English, German, Spanish, etc.), but need not; it can consist of program code (e.g., a C++ source file) or can be an HTML file, an XML file, or more generally any non-binary file.

Files containing non-displayable characters, such as documents written with MS-Word and Adobe Acrobat, cannot be processed by this software by reading the file directly. For files such as this either (a) save the file as a standard ASCII/ANSI text file (called a "Plain Text" file in MS-Word) or as a Unicode UCS-2 text file, and apply this software to that file or (b) open the document in Word (or whatever is the appropriate application), select all the text and copy it to the clipboard, then Count word frequencies with clipboard selected as the source. There is a limit on the number of characters in the text on the clipboard — 100,000 — so for large files (a) must be used, if possible.

More technically, a file which can be acted on using this program must consist only of characters with single-byte values in the range 32 through 255, except for whitespace characters: linefeeds (byte value 10), carriage returns (13), tab characters (9), backspaces (8) and page breaks (12) — except that (i) Unicode UCS-2 encoding has zero bytes and (ii) up to 1% of the bytes (other than zero bytes in Unicode text files) are allowed to be "anomalous bytes", that is, bytes with values less than 32 but which are not whitespace characters. This exception is due to rare cases where a large text file will, for some reason or another, contain a number of anomalous bytes (which should thus not prevent the program from treating the file as a text file).

There is another Unicode encoding called UTF-8. This program does not work correctly with UTF-8 text files which have non-English letters. These should be read using a text editor such as WordPad and saved as Unicode UTF-16 text files.

When clipboard is selected as the source the program counts the words in the text on the clipboard, not the words in the text in the textbox. In other words, the program does not count words in the textbox, only words either in a specified input file or words in text on the clipboard. You may compose text in in the textbox, but to do a word count on this you must first copy it to the clipboard. That's one reason there is a Copy to clipboard button (which is available only after the software has been activated).

Introduction User Manual: Contents
Hermetic Systems Home Page