Hermetic Word Frequency Counter
Scannable Files and Languages Supported

A file upon which the program acts can have any filename extension, but it must be either a docx file or consist entirely of text encoded via the 8-bit Windows-1252 encoding (this is a superset of ISO 8859-1) or via the Unicode encoding UTF-8. The input file typically consists of natural language text, but need not; it can consist of program code (e.g., a C++ source file) or can be an HTML file, an XML file, or more generally any non-binary file.

Languages which can be 8-bit encoded using Windows-1252 and UTF-8 include English, Danish, Swedish, Finnish, German, French, Italian, Spanish, Portuguese and Norwegian. Languages whose characters are mostly encodable using Windows-1252 include Dutch and Hungarian, but not Turkish, Polish, Czech, Russian, Greek or any non-European language.

Unicode encodings other than UTF-8 are not supported, nor are files in which the text is double-byte encoded (often used to encode Chinese and Japanese text).

Files (other than docx files) containing non-displayable characters, such as documents written with Adobe Acrobat, cannot be processed by this software by reading the file directly. For files such as this either (a) convert the file to a Windows-1251 encoded text file and apply this software to that file or (b) open the document, select all the text and copy it to the clipboard, then Count words with clipboard selected as the source. There is a limit on the number of characters in the text on the clipboard — 100,000 — so for large files (a) must be used.

When clipboard is selected as the source the program counts the words in the text on the clipboard, not the words in the text in the textbox. In other words, the program does not count words in the textbox, only words either in a specified input file or words in text on the clipboard. You may compose text in in the textbox, but to do a word count on this you must first copy it to the clipboard. That's one reason there is a Copy to clipboard button (which is available only after the software has been activated).

A file which can be acted on using this program (other than docx files) must consist only of characters encoded as single-byte values in the range 32 through 255, except for whitespace characters: linefeeds (byte value 10), carriage returns (13), tab characters (9), backspaces (8) and page breaks (12) — except that up to 1% of the bytes (other than zero byteS) are allowed to be "anomalous bytes", that is, bytes with values less than 32 but which are not whitespace characters. This exception is due to rare cases where a large text file will, for some reason or another, contain a number of anomalous bytes (which should thus not prevent the program from treating the file as a text file).

Introduction User Manual: Contents
Hermetic Systems Home Page