Scannable Files and Languages Supported (Phrase Frequency Counter Advanced)

Phrase Frequency Counter Advanced

Scannable Files and Languages Supported

A file upon which the program acts can have any filename extension, but it must be either a docx file or consist entirely of text encoded via the 8-bit Windows-1252 encoding (this is a superset of ISO 8859-1) or via the Unicode encoding UTF-8. The input file typically consists of natural language text in an ANSI text file, but can be an HTML file, an XML file, or more generally any non-binary file.

Languages which can be 8-bit encoded using Windows-1252 and UTF-8 include English, Danish, Swedish, German, French, Italian, Spanish, Portuguese and Norwegian. Languages whose characters are mostly encodable using Windows-1252 include Dutch, Hungarian, Estonian, Finnish and Turkish, but not Polish, Czech, Russian, Greek or any non-European language.

Unicode encodings other than UTF-8 are not supported, nor are files in which the text is double-byte encoded (often used to encode Chinese and Japanese text).

Files (other than docx files) containing non-displayable characters, such as documents written with Adobe Acrobat, cannot be processed by this software by reading the file directly. For files such as this either (a) convert the file to a Windows-1251 encoded text file and apply this software to that file or (b) open the document, select all the text and copy it to the clipboard, then Count words with clipboard selected as the source. There is a limit on the number of characters in the text on the clipboard — 100,000 — so for large files (a) must be used.

A file which can be acted on using this program (other than docx files) must consist only of characters encoded as single-byte values in the range 32 through 255, except for whitespace characters: linefeeds (byte value 10), carriage returns (13), tab characters (9), backspaces (8) and page breaks (12) — except that up to 1% of the bytes (other than zero byteS) are allowed to be "anomalous bytes", that is, bytes with values less than 32 but which are not whitespace characters. This exception is due to rare cases where a large text file will, for some reason or another, contain a number of anomalous bytes (which should thus not prevent the program from treating the file as a text file).

PFCA Main Page Further Information
Hermetic Systems Home Page