Hermetic Word Frequency Counter Advanced Version Creation of an Excel File
with a Table of
Words/Phrases vs Files
When the source of text to be scanned is a folder rather than a file then this button appears:
After specification of the folder and an output file, clicking on that button displays this:
The relative frequency of a word or phrase is the frequency count of that word or phrase divided by the total number of occurrences of all words or phrases being counted, expressed as a percentage. (For an example see Report Formats).
This function will count either (a) all words/phrases in the files in the specified folder (and optionally in all subfolders) or (b) all count-only words/phrases (if specified). When all words are to be counted (not just specified words/phrases) then what is considered to be a countable word should be specified in the 'Settings' window in the usual way (e.g., below).
Suppose on startup we have chosen to load the supplied words-to-ignore file, and that we have all the HTML files making up the section on this website on Spin Models in a folder \spin_models. In the software we set the folder to this, specify an output file and (using the default settings in the 'Settings' window), click on the button above, select 'One table, alphabetical order', and don't check the 'Leave cell blank ...' checkbox. After clicking on 'Create Excel file' the program starts to count the words (check 'Don't display words as found' if you don't wish to see the words as they are found). After the program reports that the output file was created successfully then we can can run Excel and open the file. Make sure to select 'Text Files' (not 'All Excel Files') then locate the output file in Windows Explorer and click on the output file. Excel then displays this panel (the process is very similar in Libre Office Calc):
Important: If 'File origin:' is not showing 'Windows (ANSI)' then select this. No need to click on 'Next', you can just click on 'Finish'.
When the table appears in Excel you must widen column A to see the words found. The files (in this case 23 of them) are listed, each with an index number. The table then shows the number of occurrences of each word in the file whose index is at the top of the table, as shown below:
If you prefer that cells with a zero count be blank then check the corresponding checkbox before creating the file, in which case you obtain:
Total counts are shown in the right-most column for each word and in the bottom row for each file (in this image columns B-Q have been hidden):
The Excel file can be saved as a normal Excel file by selecting 'Save As'.
This program also works with European languages other than English, e.g., German:
We can now recreate the table for the 'Spin models' files with words sorted by frequency. (The Excel file should first be closed.) We then obtain (with columns F-U hidden):
If we don't wish to count all words, but just specified phrases, then we can do that too. For example, suppose we just want to know in which files occur any of the phrases 'spin model', 'critical temperature', 'magnetization', then we can specify these in the 'Settings' window:
Then when we create the Excel file we obtain (with columns F-L hidden):
In this case there are only 17 files shown because six of the files which were scanned do not contain any of the specified phrases. Although 'spin' occurs 370 times in the 23 files, and 'model' occurs 307 times, they occur together as 'spin model' only 47 times.
The results shown above were obtained with the checkboxes (in the 'Settings' window) shown at left unchecked (for text in German the 'Upper/lowercase' checkbox should be checked). If either of these are checked before creating the Excel file then the results will be different.
Count-only words and phrases can also be specified using pattern-matching (either simple patterns or regular expressions). For example, if we wish to find all 5-word phrases whose first word is 'the', whose third word begins with 'c' or 't', and whose 4th word is 'is' (common words are not ignored) then (allowing 'words' to contain numerals, periods and forward slashes) we obtain the following result:
Introduction User Manual: Contents Hermetic Systems Home Page