Hermetic Word Frequency Counter Advanced Version
Counting Words and Phrases with Pattern-Matching

This software allows you to count words and phrases which match specified patterns. There are two ways to specify a pattern, the simple way (using the wildcards * and ?) and the less simple way (using regular expressions).


Simple Patterns

Those old enough to remember using the MS DOS operating system will recall that a set of files could be specified by using * and ? in a filename, e.g., *.txt. A similar pattern-matching convention is implemented in this software for word and phrase counting, the only difference being that ~ rather than * is used to mean any string of characters .

So a~ means any word beginning with a, and ~th means any word ending in th. ? means any single character, so b?t matches bat, bet, bit and but, but it does not match bt or bite. b~t? matches bite, boats, booty and behemoth, but not baton. So if you specify they b?t ~ b~t? as a pattern for phrases to be counted then the count will include all occurrences of they bet their booty, they bit the bullets and so on, but not they bought four bats.

~ matches an empty string only if combined with other characters, as in a~ and a~t. When used alone it means any non-empty string, as in a ~ person. ? must always match some character. So t~ ??? ~ would count all sequences of three words whose first word begins with t and whose second word has exactly three letters.

Multiple patterns for phrases to be counted may be included in the Extra count-only words/phrases textbox in the Set Parameters window. They may not be included in files containing count-only words/phrases, but a such a file may be used together with patterns for phrases to be counted.

Here are sample results from an English web page and from a German web page for the count-only words with patterns lun~, sol~ (with upper/lower case significant for the German but not for the English):

Pattern-matching can also be used when counting words in multiple files. For example, suppose we wish to find all pairs of words, of which the second is either "lattice" or "lattices", which occur in the 23 HTML files composing the online version of the author's M.Phil. thesis. After specifying the folder which contains these files, we can use the pattern ~ lattice~ to obtain the result at right (click for full image).

If we wish to find all pairs of words (allowing a word to contain a hyphen) of which the first has exactly eighteen letters and the second has at least four then we can count using the pattern ?????????????????? ????~, to obtain the result at right.


Regular Expressions as Patterns

The previous pattern, can also be specified using a regular expression, namely, ^.{18} .{4}.*$. In regular expressions (in contrast to the simple pattern matching method given above) a '.' means any character and a '*' means any (possibly zero) number of the preceding character string, so (ignoring the '^' and '$') this regular expression means: eighteen characters followed by a space followed by four characters followed by any number of characters.

If regular expressions are not for you then use the simple method of pattern matching given above. The remainder of this section explains how to use regular expressions with this program, which is easy if you know what they are.

A regular expression can be used (in this software) just like a simple pattern such as a~ provided that it is placed between angle brackets (and '^' and '$' are used to ensure counting whole words, not parts of words). So instead of a~ you could use <^a.*$>, and instead of b?t you could use <^b.t$>. Spaces are not permitted within the angle brackets, so if you are counting a phrase then each word in the phrase must be placed within angle brackets. So instead of ?????????????????? ????~ you could use <^.{18}$> <^.{4}.*$>.

Patterns specified via regular expressions are checked for validity and an error message is displayed if one is found to be invalid.

Simple patterns may be mixed with regular expression patterns. For example, if you are counting phrases in this article, and you specify

<^jul.*$> ca~r <.*> ?o

as the pattern for the phrase to be counted, then the result is as shown at right. Of course, the same result could be obtained by using the simple pattern jul~ ca~~ ?o, but not all phrase patterns using regular expressions can be written using the simple pattern matching method.

An example of this is a regular expression for a valid email address. There are variations on this, but the following regular expression will match almost all email addresses in actual use:

^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$

So using this pattern specification with a file containing, e.g., messages sent to a mailing list, produces a result such as shown below (with parts of the email addresses redacted).


To obtain email addresses in this way you must (in the 'Set parameters' window) allow 'words' with @-signs, numerals, underscores, hyphens and periods.

Introduction User Manual: Contents
Hermetic Systems Home Page