The Indexing Service on Windows 2000 allows us to create a search engine for our site. Documentation on it though, is amazingly scarce and scattered. There’s plenty out there on how to use it with IDQ/HTX templates, but as far as I see it, there are three fundamental problems with those:
- they are hard to use as they require you to learn a separate language syntax.
- they don’t allow you to use standard ASP code, as they pass through a special DLL filter and not the ASP.DLL. For example, you cannot use include files.
- they are old technology.
In this article, I will take you through what is necessary to get it working on your site. I will go through:
- installing the service,
- pointing it to your web site,
- tuning it for speed and efficiency,
- the ASP code which makes it work,
- and what to do if it fails.
The Indexing Service version 3 is the only one that will run on Windows 2000 at this time (April 19, 2002). It is installed by default. If you have not already installed the Indexing Service during the Windows setup (or are not sure if you have or not and you want to check), here’s how you would do it:
Start > Settings > Control Panel > Add/Remove Programs > Add/Remove Windows Components
If you see Indexing Service checked, then it’s already installed. Let’s assume it’s not installed yet. Check the Indexing Service and then press Next. Windows will install the necessary files and when it’s done, you need to click on Finish. You might be asked for the Windows 2000 Server installation CD to copy the necessary files.
The service should now be installed. To check on the service, go to:
Start > Programs > Administrative Tools > Computer Management
Then click on:
Services and Applications > Indexing Service
You should be able to expand the service and see the 2 Catalogs which were created by default: a System catalog and a Web catalog (this one only if it finds an instance of Internet Information Server (IIS) already running on the server).
Creating a new Catalog
A catalog is like a database where the service stores all the information after it is done indexing your files. My recommendation is to erase both catalogs which are created by default. The Web points to C:\Inetpub (potential security hole), and the System is only useful if you are going to do local searches on the server. If you are using the service only through the internet then it’s safe to erase both of them.
So, let’s say you want to use the service to search against your web site. First, you have to create a web site through the IIS console. I will assume that you know how to do that already and you have a web site up and running. Once you have that running, then you have to tell IIS to Index the web site. You do that through the Home Directory tab of the site Properties. Make sure Index this Resource is checked. If not, check it.
All subfolders of your web site will be indexed if you do this. If you want to exclude a certain folder from being indexed (for example, your images folder), navigate to that folder in the console, go to its properties and uncheck the Index this resource. There is another way to remove folders from a catalog, which I will go through later, but this is the recommended way for websites. This is where a good design of your website is necessary. Put all your images into a folder called images for example, and turn off indexing for it. Do the same for all your content that you want indexed or not indexed. That way, you can just check/uncheck the Index this resource on that folder’s properties and it propagates to all the folders/files under it.
By default, the Indexing Service will index HTML files, text files, Office 95 and later files, internet mail and news, and any other document that a filter is provided. For example, Adobe makes its own IFilter which once installed, helps the service index Acrobat (pdf) files.
The next step is to create a new catalog to house all the information. It’s probably a good idea to create a new folder to use exclusively for your catalog(s). Do not save your catalog under a folder that is being indexed by the same catalog or any other. Your English site could be under C:\catalogs\english, your French at C:\catalogs\french, etc. First create those folders. Then open your Computer Management Console, and right click on the Indexing Service, or click on Action on top, and go to New > Catalog.
Type a name for your catalog, and pick a location where you want to save the catalog (our english catalog would go under C:\catalogs\english).
After you create it, you need to specify what to include or not include in it so that the service will start indexing that content. Right click on the catalog you want to edit, click on Properties and move to the Tracking tab. In this case we want it to point to a web site, so you have to tell it what web server to associate it with. Pick one from the pull-down list.
Now when you start the service, it will start indexing your web site. Under the Generation tab, you can select whether to inherit parent attributes or uncheck it so that you can customize it. I chose only to Generate Abstracts and not index files with unknown extensions. Abstracts are another word for the HTML description meta tag, which goes in the HEAD of the document. When indexing an HTML page, it will look to see if there is one in the HEAD. If there is none, then it will pick the first 320 characters from the body to create the abstract. The maximum number for this string length is 500. To define your own abstract in an HTML document, add a DESCRIPTION meta tag in the head of your file, like this:
As promised earlier, here is another way to add or delete folders to be indexed. Go to the Indexing Service Console and right click on Directories and go to New > Directory.
Doing so, gets you to the Add Directory dialog box:
Choose the path of the folder you want to add to your catalog, and choose from the radio button whether you want it included or excluded from the index. You can add folders on remote computers as long as they are correctly mapped in your system. The Alias (UNC) is not necessary.
The “noise” file
Inside your C:\WINNT\system32 folder you should find a file called noise.eng. Open it with a text editor like Notepad. You will see that its contents are single words or numbers, one under the other, each on its own line. This is the word exception file, and the Indexing Service uses this file when it indexes a file to exclude the words that are there. These are common words, like and, or, or numbers. You can edit this file, adding or deleting your own words. If you edit these files, you will need to empty the catalogs and restart the Indexing Service, so that the updated exception list can take effect.
There is a different noise file for every language: noise.enu is specifically for the U.S.A. as opposed to noise.eng which is for U.K. english. The French file is called noise.fra, the German noise.deu, and so on. You can see a list of all your files in the registry. Run regedit from Start and navigate to: HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Control > ContentIndex > Language. You will see a listing of all the languages, and the key name is NoiseFile.
First stop the Indexing Service. Once you do that, you can tune the performance of the engine. Go to All Tasks > Tune Performace:
You will see the following menu:
You can choose Dedicated Server if you want to make this catalog and this service immediately responsive to changes on the file system. You can also select Customize and then click on the button, which will give you this dialog:
Move the Indexing slider to Lazy for less immediate indexing or to Instant for immediate indexing of new and changed documents. Lazy indexing uses fewer resources; Instant indexing uses as much of the computer’s resources as it can. Move the Querying slider to Low load if you expect to process only a few queries at a time or to High load if you expect to process many queries at a time. Low load uses fewer resources; high load uses more. You can increase or decrease these settings are you see fit. Keep in mind that doing so will cause your server to use more resources for this activity. I have used the above setting for large sites with thousands of documents with success.
One thing about the Indexing Service’s resources you should know about: It is very demanding on the OS when it is first started as it tries to index everything in the catalog. It moves through pretty quickly, indexing thousands of html documents in just a few minutes. But once it finishes it just sits there, not really using many resources. It responds to file changes through the OS, so it knows to index a file once it’s changed/created/deleted. This way, you can keep the service running on a small computer and you still get good performance out of it.
The search input form(index.html)
This file is the form that accepts your search arguments.
I limit the query string that a user can input by 100 characters. That should be enough for everybody and helps prevent hacking. You can change this if you like.
The Scope is another word for folder. It is used to tell the Indexing Service if it’s going to search everything (/), or just under a specific folder of the site(/products/). You can go as deep as you like and it will only search under that folder: for example /products/bicycles/electric/kids/ would search for documents only under the kids folder.
Sometimes users want to set how many results they see on one page: fewer or maybe more. You can allow them to do this through a simple pulldown menu as shown above, or through a text box where they type the number themselves.
It’s common to list results in order of best match. However, it’s possible to rank the resultset under any of the properties in the catalog. You can theferore allow the user to choose the ranking order. Above, I give them the option of ranking the results in order of simply Rank, Size or Date Last Updated.
And last, the submit button.
Finally, show me some code! (runsearch.asp)
The picture above shows what the returned results should look like. This file, runsearch.asp is responsible for issuing the search against the catalog and properly displaying the returned results. The code should work right out of the box for you, as long as you change a few variables.
It consists of:
|Global variables||Look for a section called EDIT THESE…END EDITsomewhere in the beginning of this file and change those parameters to fit your system. Those should be the only ones you need to change, the rest is up to you.
|Sub RunSearch()||This is the main sub that gets called when the page loads and then calls everything else. It creates a connection to the Indexing Service, gets records through GetRows(), loops through, checks, validates and formats the output.|
|Function BuildQuery(strScope, strQuery)||This function returns a full SQL command to use against the Indexing Service with ADO. Out of the box you get searches against htm, html, asp, ppt, doc, xls, txt, and pdf files, and does not support boolean searches.|
|Sub WriteNavigation(strNavigation, intTotalRecords, intTotalPages)||This sub creates the text for the top navigation links that you see in the picture, i.e. moving from page to page.|
|Function FileSize(intFileSize)||Formats the size output of a file to KB, MB, GB, etc.|
|Function myFixDate(datWrite)||Formats the date last modified output to an international date format.|
Let’s go through the code here in detail to help you understand what’s going on. I have added some error catching, to account for mistakes as well as for malicious users trying to break your site.
Between EDIT THESE…END EDIT is what you have to change to make it work for your site. One of these variables is called strCustomTitle. This is a little trick that I use to increase the ratings of my site, and you can do it too. Here’s how it works: when one of the public search engines visits your site to index it, one of the most important factors in rating your site is the <title> tags in your pages. You can increase your ranking by including the name of your site in your titles. Let’s say your site’s name is XYZ. Your titles could then all start with “XYZ – ” and then continue with a more descriptive title of the page.
This accomplishes 2 things:
- It boosts your rankings when it comes to your name
- It improves the readability of someone’s bookmarks to your site.
However, when I display the results of the search I use the title of the page as a link to the actual page. At this point, we do not want to show all our titles beginning with the same thing, so we simply edit the title before displaying it. Edit that variable in the code if you are going to use this, and leave it blank (strCustomTitle = “”) if you are not going to use it. If that string is not empty, it will check and remove a matching string from the beginning of each title in the displayed results. If it’s not, then it will display the whole title as is.
The code above collects the user’s inputs and tries to make sure that they fall within certain limits. This also helps prevent hacker attacks. If everything is ok, it calls the main sub RunSearch() which does all the work.
Plainly, this sub calls everything else. It connects to the catalog, issues the query, returns a recordset, and then it formats it appropriately and writes it out. The rest of the functions are responsible for formatting the resultset.
The WHERE clause
The query that you create against the Indexing Service can be as complex as you want it. Here are some more things you can do with it:
|CONTAINS||The following line matches documents that contain toys or factories:WHERE CONTAINS(“toys” OR “factories”)Toys is within 50 words or less of factories: WHERE CONTAINS(“toys” NEAR “factories”) –>this feature was cut from the RTM version of IS 3 at the last minute. The documentation has not been reflected to account for this cut. So the NEAR syntax is ignored, but there is no error message. The 50-word window is built into the FreeText ranking algorithm. –>.NET: the proximity operator works, but you still can’t specify the distance. To match toys, toy, toyed, etc.: WHERE CONTAINS(‘FORMSOF(INFLECTIONAL, “toy”)’)|
|FREETEXT||When you want to search for the best match for a word or a phrase:WHERE FREETEXT(‘toys for kids’)|
|LIKE||Wildcards to perform matches:WHERE DocTitle LIKE ‘%toy%’|
|MATCHES||Uses regular expressions to perform matches. For example, all entries where DocAuthor starts with any character between a and e:WHERE MATCHES (DocAuthor, “[a-e]*”)|
|NULL||Matching of null values:WHERE DocTitle IS NULL WHERE DocTitle IS NOT NULL|
Well, you got the search engine working and everybody is happy. You are receiving kudos from everyone around. But if the search functionality on your site is vital, how can you ever know if something is wrong? Let’s talk about how we can take precautionary action to attempt to fix the service automatically, and be warned if something is wrong. Then you can really sit back and enjoy.
Go to Start > Programs > Administrative Tools > Services:
Double click, or right click and go to Properties, on the Indexing Service to open the Indexing Service Properties dialog. Click on the Recovery tab.
Here, you can define the actions to take once your service fails. You can try different scenarios that best fit your needs. I decided to try a restart on the First failure and then to Run a File on the Second failure. I created a folder called ServerScripts and placed my custom script files to run in there. The SendEmailOnServiceFail.vbs file first makes sure the service is down by attempting to shut it down again, and then tries to bring it back on. It then sends an email to a person to notify them that the service had to be restarted, and may still not work fine. This file uses WSH and the CDONTS to send the email, and you need to have correct access on the windows system to do this (for example Administrative).
The default timeout time for vbs files like these is 10 seconds. If you want to change that, right click on the vbs file, and click on Properties. Go to the Script tab, and then click on Stop script after specified number of seconds. When you do that, it will allow you to change the 10 seconds default value to whatever you want. When you click OK, the dialog will create a file in the same folder as your vbs file, give it the same name with the extension “.wsh“. This is what a sample file would look like:
You can test it by double clicking on the vbs file to run it. So, in the future, if the service fails, you will receive an email alerting you of the fact. The email should look something like this:
Putting it all together
To summarize, here are the steps you need to take to make this work for you:
- create your website
- turn indexing on for the site
- install the Indexing Service
- create a catalog
- associate the new catalog with your site
- add/remove folders from catalog
- tune performance
- edit noise files
- copy search input file on your site (index.html)
- copy images for ranking on your site
- copy runsearch.asp on your site
- make sure index.html file is posting to the runsearch.asp
- edit the 3 variables in runsearch.asp file
- create recovery procedures
Running an ADO query may not be the most flexible way to work with the Indexing Service, but it is the simplest by far. For most cases this is good enough. You can do a lot more with this service. For example, you can have it index custom meta tags and then add those to your queries/results. Or you can export the catalog into a relational database like SQL Server, and then combine it with a content management system for a more advanced search. This article’s intention was to give you a quick way to get it up and running. Feel free to alter this as much as possible, and give me feedback. Maybe you can make my code faster or more reliable, or simply expand on it.