Index Files Search Words
The Index Module
 

The index module of Index Files Search Words is for indexing HTML or text files in a hierarchy of files on CD, hard disk, etc. (not on an online website), so as to provide the data needed by the search module.


Which files can be indexed (and searched)?

Indexable (and thus searchable) files are text and text-like files, which may be ordinary text files, HTML files, XML files or in general any non-binary file, that is, any file which consists only of text characters plus whitespace. This excludes files produced by MS-Word and Adobe Acrobat, but allows text which includes non-English letters such as ä, é and ñ.The software reads 16-bit Unicode text files as well as 8-bit ASCII text files. Some text files are automatically excluded if they have file extensions such as css and js (a complete list of excluded file extensions is given below). The software does not allow words to include apostrophes (as occurs in French text), numerals or other non-text characters such as colons.


Inclusion of subfolders

On the first run the index module appears thus:

A top folder must be specified, which is the folder containing the files to be indexed. If you wish to include files in subfolders of the top folder in the indexing then click on the 'Subfolders' button and mark the subfolders you wish to include. If you wish to index files in all subfolders (and sub-subfolders, etc.) of the included folders then check the 'Include subfolders of subfolders' checkbox.


Example of use

An example will now be given to demonstrate a simple case: the indexing of a set of files in a folder with no subfolders.

This example uses the HTML files which make up the author's M.Phil. thesis on spin models. If you wish to follow this example in a hands-on manner first install the trial version of the software. Download the file spin_models.zip (370 KB) and unzip the files into some folder on your hard disk, say, spin_models.

Run the software and call up the index module. First specify a project name, say "spin_models". The project name is used in the names of the two files (the index file and the data file) which are generated, so certain characters are not permitted, including spaces (a typed space is automatically converted to an underscore). Project names cannot be longer than 17 characters.

Now specify a title for the set of files to be indexed, in this example, "Spin Models".

Now click on the 'Top folder' button to specify the top folder where the files to be indexed (in this case the files unzipped from spin_models.zip) may be found, say, spin_models.

At any time you can click on the 'Settings' button, with selection 'Save for next run', to save the state of the program so that it is loaded when the program is started again.

The folder spin_models has (or should have) no subfolders. All files in this folder will be indexed unless excluded. Exclusion of files is explained in detail below in the section Files to be excluded. In this example we will explicitly exclude three files, app3.htm, app4.htm and app5.htm, since they are of little interest. To do this, click on the 'Exclude files' button, and select app3.htm. Then hold down the ctrl key and select app4.htm and app5.htm. Then click on the 'Open' button. The result should be:

Finally the program needs to know where to write the output files to. Click on the 'Output folder' button and specify some folder, say, ifsw_files. (If this folder does not already exist then you must create it using Windows Explorer.)

That's all the setup required in this example. As noted above, you can select 'Save for next run' from the 'Settings' drop-down menu (click on the button to save the settings).

Click on the 'List files' button to see the files which will be indexed.

Now click on the 'Create index' button to begin the extraction of the data. The program informs us that there are 19 files to be scanned, and after the operation is complete the screen should look like this:

Two files are produced (in this example, in the \ifsw_files folder), with names ending in _index.ifsw and _data.ifsw. In this example the index file is 30 KB in size and the data file is 43 KB. To use these files to search for words in the indexed files go to the continuation of this example in the page on the search module.

Here is the result of indexing a larger set of files (performed on a 64-bit PC running Windows 7):


Use of the index and data files with the search module

Both the index file and the data file are used by the search module. However only the index file must be found by the search module, because if the data file is missing then the search module will recreate it. So you can either keep the data file or delete it, depending on whether or not you have a lot of unused disk space. Keeping it means that the search module does not have to recreate it, which saves a bit of time, but not a lot. If you keep the data file then it must be in the same folder as the index file (which is where the search module will look for it).


Displaying files to be indexed

After you have specified the top folder, etc., and are ready to create the index file, it is advisable to view the files to be indexed before doing so. Click on the 'List files' button to see which files will be indexed.

If you see any files that you don't want indexed, exclude them explicitly as described in the next section.

It is advisable to exclude files which are not necessary but which might contain many words which do not occur in other files, since the indexing time and the size of the index and data files increases with the number of words found.


Excluded files

All files in the top folder (unless those files are excluded) and in all included subfolders (and optionally in all subfolders of those subfolders) will be indexed except for the files which (a) are specifically excluded, (b) are non-text files, (c) are automatically excluded, (d) are smaller than a specified size or (e) contain special letters (optionally; see Exclusion of files and words with special letters below).

As noted above, the ability to list the files which will be indexed is useful because it allows you to identify any files which you wish to exclude from indexing; you can then exclude those files explicitly.

To do this click on the 'Exclude files' button and select the file, or multiple files (as illustrated in the example above). That file (or those files) will then be added to the textbox which lists the excluded files (unless the selected file is not in an included folder).

The 'Excluded files' textbox is editable. Thus if you wish to remove a file which you have selected for exclusion then you can delete it from the textbox; just be sure that the excluded files are one to a line (blank lines are ignored).

The index module scans only text files, that is, files which contain only bytes with byte values in the range 32 through 255) plus white space (that is, bytes with values 32, 13, 10, 9, 12 and 8, which are the byte values for space, carriage return, linefeed, tab, formfeed and backspace respectively). Thus any binary file, such as a graphics file, is automatically excluded from indexing, and so does not have to be excluded explicitly.

Files with the following file extensions are automatically excluded (even though some of them are text files); no file with any of these file extensions will be indexed:

bas  bat  bin  bmp  cgi  com  css  dbf     
dll  drv  exe  fmt  frm  frx  fx   gif     
ico  idx  ifsw ini  jpe  jpg  js   kbd     
mdb  ocx  ovl  pdf  png  prg  rar  rsc     
scr  swp  sys  tiff tlb  usri usrs vbp     
vbs  vbw  vbx  vxd  xls  zip  htaccess     

MS-Word .doc files will be excluded because they are not text files, but it is possible that .doc files produced by some other application might be text files. Thus files with extension .doc are not automatically excluded, nor are files with extension .php.

HTML files of less than 1000 bytes usually don't contain any text worth indexing, but rather are redirection files, files with a "This page has moved" notice, etc. Inclusion of such files, if there are many, would significantly increase indexing time, and would possibly lead to irrelevant results when the search module is used, so it is useful to be able to exclude files smaller than a certain minimum filesize.

As noted above, you can exclude files in the top folder while including files in subfolders by checking the 'Exclude files in the top folder' checkbox.


Excluded filepaths

A file is normally identified by a filepath, which is the name of the file preceded by the name of the folder containing it, preceded by the name of the folder containing that folder, and so on, up to the root folder. For example, if folder aaa is in the root folder of Drive C:, and contains a subfolder bbb which contains a subfolder ccc which contains a file file.txt then the filepath for that file is C:\aaa\bbb\ccc\file.txt.

The index module allows exclusion of files whose filepaths contain a given character string, which may be a substring either of a folder name in the filepath or a substring of the file name. This allows you to exclude subfolders of subfolders of the top folder and to exclude files with a certain extension (if it is not already among the file extensions which are automatically excluded). After examining (in Windows Explorer) the hierarchy of local files for a website, or (in this software) the list of files to be indexed, you might, e.g., exclude filepaths as below:

In this case all files in all subfolders (regardless of where they occur) whose filepaths include bak, unused, \no_index or 1990s\ would be excluded from indexing, and no files whose filenames end in .bak .xml, .txt or _c.htm would be indexed. Character strings must be separated by commas, may contain spaces, and are not case-sensitive.

If you have too many character strings to be visible all at once in the textbox then you can click on the 'Expand' button and a window will open displaying them one per line. You can edit this textbox by adding or deleting character strings, and then cancel or confirm. If you confirm then the strings will reappear in alphabetical order in the 'Exclude filepaths' textbox.

This facility may be used in conjunction with the file listing facility (see above). The file list may be inspected for files which you wish to exclude, then some identifying part of the filepaths (such as a subfolder name) may be added to the 'Exclude filepaths' textbox or to the 'Filepath exclusions' textbox.


Exclusion of files and words with special letters

A letter is special if it is not part of the alphabet used for English text. For example, the following letters are special:

ä ê ñ ø ß ù

All special letters have ASCII byte values of 192 or higher.

Approximately 1% of the letters in Spanish text are special, as are approximately 0.5% of the letters in German text.

If the hierarchy of files to be scanned includes files with non-English text then you may wish to exclude these if the index module is meant only for searching English text. This can be done by checking the 'Exclude files and words with special letters' checkbox.

When this is checked a file will be excluded if more than 0.1% of the letters are special letters and the file contains at least three different special letters. Thus a file will not be excluded if it just contains a few occurrences of special letters (such as sometimes occurs in English text, e.g., resumé and naïve, or when a foreign name is used, e.g., Bülow).

This is useful because if files containing non-English text are included then many non-English words will be found. This can increase the number of words found by as much as 40% and result in a major increase in indexing time as well as in the sizes of the index and data files.

This software detects special letters not only in text files but also in HTML files, since it also looks for the HTML entities used to code special letters, e.g., ü and ñ.

If this checkbox is checked then the indexing time will be greater than otherwise. If there are only a few files with non-English text then it is better to exclude them explicitly and leave this checkbox unchecked.

Index Files Search Words The Search Module
Hermetic Systems Home Page