Hermetic MultiFile Search
The Index Module

The index module of Hermetic MultiFile Search is for indexing files on CD, hard disk, etc. (not on an online website), so as to provide the data needed by the search module.


Which files can be indexed (and searched)?

ANSI is the single-byte text encoding which is the default encoding on your PC. UTF-8 is a variable-byte-length encoding of Unicode characters, often used in HTML and XML files.
Indexable (and thus searchable) files are MS Word DOCX files (but not Word DOC files), text and text-like files, which may be ordinary text files, HTML files, XML files, and in general any non-binary file, that is, any file which consists only of ANSI text characters plus whitespace. This allows text which includes non-English letters such as ä, é and ñ. The program works with DOCX files and with text-like files which consist entirely of text encoded via the 8-bit Windows-1252 encoding (this is a superset of ISO 8859-1). Languages which can be 8-bit encoded using Windows-1252 include English, German, French, Danish, Italian, Norwegian, Portuguese, Spanish, Swedish and Finnish.

Some text files are automatically excluded if they have file extensions such as css and js (a complete list of excluded file extensions is given below).

This program regards a word as any consecutive sequence of letters. Thus a word may not include an apostrophe, a hyphen or a numeral. Non-english characters such as ü and é are allowed in words, so this software can be used with text in most European languages such as German, Spanish, etc.


Inclusion of subfolders

On the first run the index module appears thus:

A top folder must be specified, which is the folder containing the files to be indexed. If you wish to include files in subfolders of the top folder in the indexing then click on the 'Subfolders' button and mark the subfolders you wish to include. If you wish to index files in all subfolders (and sub-subfolders, etc.) of the included folders then check the 'Include subfolders of subfolders' checkbox.

For an example of how to index a set of files see Example of Use.


Use of the index and data files with the search module

Both the index file and the data file are used by the search module. However only the index file must be found by the search module, because if the data file is missing then the search module will recreate it. So you can either keep the data file or delete it, depending on whether or not you have a lot of unused disk space. Keeping it means that the search module does not have to recreate it, which saves a bit of time, but not a lot. If you keep the data file then it must be in the same folder as the index file (which is where the search module will look for it).


Displaying files to be indexed

After you have specified the top folder, etc., and are ready to create the index file, it is advisable to view the files to be indexed before doing so. Click on the 'List files' button to see which files will be indexed.

If you see any files that you don't want indexed, exclude them explicitly as described in the next section.

It is advisable to exclude files which are not necessary but which might contain many words which do not occur in other files, since the indexing time and the size of the index and data files increases with the number of words found.

In particular, a single file in a language different from the other files should be excluded (unless required) because this will greatly increase the number of words to be indexed.


Excluded files

All files in the top folder (unless those files are excluded) and in all included subfolders (and optionally in all subfolders of those subfolders) will be indexed except for the files which (a) are specifically excluded, (b) are non-text files, (c) are automatically excluded or (d) are smaller than a specified size.

As noted above, the ability to list the files which will be indexed is useful because it allows you to identify any files which you wish to exclude from indexing; you can then exclude those files explicitly.

To do this click on the 'Exclude files' button and select the file, or multiple files (as illustrated here). That file (or those files) will then be added to the textbox which lists the excluded files (unless the selected file is not in an included folder).

The 'Excluded files' textbox is editable. Thus if you wish to remove a file which you have selected for exclusion then you can delete it from the textbox; just be sure that the excluded files are one to a line (blank lines are ignored).

The index module scans only text files, that is, files which contain only bytes with byte values in the range 32 through 255) plus white space (that is, bytes with values 32, 13, 10, 9, 12 and 8, which are the byte values for space, carriage return, linefeed, tab, formfeed and backspace respectively). Thus any binary file, such as a graphics file, is automatically excluded from indexing, and so does not have to be excluded explicitly.

Files with the following file extensions are automatically excluded (even though some of them are text files); no file with any of these file extensions will be indexed:

bas bat bin bmp cgi com css dbf dll
doc drv exe fmt frm frx fx gdoc gif
ico idx ini jpe jpg js kbd mdb ocx
ovl pdf png prg rar rsc scr swp sys
tiff tlb usri usrs vbp vbs vbw vbx
xls zip  htaccess

HTML files of less than 1000 bytes usually don't contain any text worth indexing, but rather are redirection files, files with a "This page has moved" notice, etc. Inclusion of such files, if there are many, would significantly increase indexing time, and would possibly lead to irrelevant results when the search module is used, so it is useful to be able to exclude files smaller than a certain minimum filesize. You can also exclude files larger than a certain maximum filesize, in case there are some excessively large files in the folder(s).

You can exclude files in the top folder while including files in subfolders by checking the 'Exclude files in the top folder' checkbox.


Excluded filepaths

A file is normally identified by a filepath, which is the name of the file preceded by the name of the folder containing it, preceded by the name of the folder containing that folder, and so on, up to the root folder. For example, if folder aaa is in the root folder of Drive C:, and contains a subfolder bbb which contains a subfolder ccc which contains a file file.txt then the filepath for that file is C:\aaa\bbb\ccc\file.txt.

The index module allows exclusion of files whose filepaths contain a given character string, which may be a substring either of a folder name in the filepath or a substring of the file name. This allows you to exclude subfolders of subfolders of the top folder and to exclude files with a certain extension (if it is not already among the file extensions which are automatically excluded). After examining (in Windows Explorer) the hierarchy of local files for a website, or (in this software) the list of files to be indexed, you might, e.g., exclude filepaths as below:

In this case all files in all subfolders (regardless of where they occur) whose filepaths include bak, unused, \no_index or 1990s\ would be excluded from indexing, and no files whose filenames end in bak xml, txt or _c.htm would be indexed. Character strings must be separated by commas, may contain spaces, and are not case-sensitive.

If you have too many character strings to be visible all at once in the textbox then you can click on the 'Expand' button and a window will open displaying them one per line. You can edit this textbox by adding or deleting character strings, and then cancel or confirm. If you confirm then the strings will reappear in alphabetical order in the 'Exclude filepaths' textbox.

This facility may be used in conjunction with the file listing facility (see above). The file list may be inspected for files which you wish to exclude, then some identifying part of the filepaths (such as a subfolder name) may be added to the 'Exclude filepaths' textbox or to the 'Filepath exclusions' textbox.

Hermetic MultiFile Search The Search Module