Hermetic Word Frequency Counter Advanced Version Count All Words or
Count Specified Words and Phrases
This section of the user manual explains what the "Count words/phrases" button does. The next section (Count All Phrases) explains what the "Count all phrases" button does. If you never wish just to count all words (or to count all specified words and phrases) but always wish just to count all phrases then you can skip this section.
A phrase is a sequence of two or more words (separated by spaces). If all phrases are counted then (in any moderately-sized section of text) there are a huge number of them. Usually the interesting ones are those which occur more than once. To count these see the next section.
If you are interested in counting only particular words or phrases then they must be specified. These are called count-only words/phrases. If there are just a few then they can be added to the Extra count-only words/phrases textbox in the Settings window. If there are many words/phrases to be counted then they should be placed in a text file, and that file specifed by means of the Count-only words/phrases files button in the Settings window.
In a count-only words/phrases file the words and phrases must either occur on separate lines or — if within a line — be separated by comma+space (as shown below). The words/phrases do not have to be in alphabetical order. For example, the first three lines of such a file might be:
When the count-only file is loaded (from the Settings panel) the words/phrases are displayed in the textbox at the main panel as follows:
The main screen has two buttons for counting words and phrases. If you wish to count all phrases in one or more files then click on the Count all phrases button; what happens then is explained in the next section. If you wish to (a) count all words (not phrases) in one or more files or (b) count only specified words or phrases then click on the Count words/phrases button. This brings up a panel whose content depends on the settings, specifically, on (a) whether there is a specification of words to ignore and (b) whether there is a specification of words and phrases to count. The usual case is where you wish to count all words (not just specified words/phrases) except for those given in a words-to-ignore file (typically containing common words such as 'the'). In this case clicking on the Count words/phrases button brings up this panel:
Clicking on the Count words button will then count all words (except for words to ignore) in the file or files.
If you have specified a list of words or phrases to be counted (either in a count-only words/phrases file or in the Extra count-only words/phrases textbox in the Settings window) then clicking on the Count words/phrases button brings up this panel:
To count the specified words/phrases click on the left button. If you decide you really want to count all words (not phrases) then click on the right button (this will remove the specification of words/phrases to count in the Settings window).
The results can be displayed in various formats by selecting one from the Format drop-down list. If the file source is a folder then the most detailed is the option word file-list (+freq). A less detailed format which gives the number of files in which a word or phrase occurs is word freq. no.files. If you have not selected this then at this panel you are given an opportunity to do so.
Other display formats which show the files in which words/phrases occur are described at Display Formats.
If there are count-only words specified then any specification of words to be ignored will be inoperatve. Thus words-to-ignore are not ignored if they occur in count-only phrases. For example, if "is" is to be ignored then it will not be ignored in a phrase to be counted such as "service is available". But (as noted previously) when counting phrases, the presence of some words to ignore may make a difference if the Exclude phrases consisting only of words to ignore checkbox in the Count all phrases window is checked.
Upper and lower case in the results is or is not distinguished depending on the setting of Upper/lower case significant in the Settings window. For example, if a text file contains "Jack Smith" three times and "Jack smith" twice, and the phrase "Jack Smith" is to be counted, then the result will be as follows: If upper/lower case is not significant then "5 jack smith". If significant then "3 Jack Smith, 2 Jack smith".
If words and phrases to be counted are placed in a count-only words/phrases file then the following conventions should be observed:
- A phrase must be contained within one line in the file (that is, a phrase cannot extend over more than one line).
- A line may contain more than one word or phrase, but if so then they must be separated by a comma plus a space (not just a comma), e.g., calendar, common era, common era calendar.
- A word or phrase to be counted may include a comma (if this is allowed in the Settings window), but may not include a comma+space.
There is a limit of 30 on the maximum number of words in a phrase. A larger maximum number increases processing time, so the maximum should not be set to a significantly greater number than needed.
A count-only words/phrase file may contain text in languages other than English, but the file must be ANSI encoded. At present UTF-8 encoded files cannot be used as count-only files.
To count only words and phrases which match a certain pattern see Counting Words and Phrases with Pattern-Matching.
It is possible to switch between multiple files containing words/phrases to be counted, as explained in the section Multiple Words-to-Ignore and Count-Only-Words/Phrases Files.
As an example, here is the result of counting the occurrences of over 1000 words and phrases (in a file of size 23 Kb) in a text file of size 508 Kb, with the default settings at the 'Settngs' window, no extra count-only words and no patterns included in the count-only words/phrases file.
The operation took 32 seconds. If there were several extra count-only words then it would take about twice as long. If pattern matching were enabled for the count-only words file (whether or not it included patterns) then it would take very much longer, due to the number of phrases in the count-only words/phrases file.
Introduction User Manual: Contents Hermetic Systems Home Page