Looking For Answers
Today, when you think about looking for something on the Web, there's only one search engine you turn to - Google. With its speed, unique indexing technology and huge database of Web pages, Google has rapidly become the best search engine on the Web, with results that are frighteningly accurate and search algorithms that are optimized for the hyperlinked, diversified information structure of the Web.
However, this isn't an article about Google - more than enough has already been written about it, by people far more experienced and knowledgeable than yours truly. Rather, this is about something related - setting up a search engine for your own Web site, so that users can locate what they're looking for quickly and efficiently. If you've ever attempted this exercise, you know that it can take anywhere between a few hours to a couple of weeks, depending on the requirements and the amount of precision needed.
There are many ways to index the content of your site. You could store the content in a database, index it and use SQL queries to look for records matching the search string. You could scan the site content to build word frequency tables, and use those tables to locate matching pages. You could use a natural-language or fuzzy search engine to create an index for your site and return results scored by relevance. Or you could save yourself a lot of development time and effort, and just install ht://Dig
What is ht://Dig? Good question. Come on in and find out.
Digging Deep
In the words of its official Web site at http://www.htdig.org/, ht://Dig is "a complete world wide web indexing and searching system for a domain or intranet...meant to cover the search needs for a single company, campus, or even a particular sub section of a web site." ht://Dig was originally developed at San Diego State University, and is today very popular amongst developers looking to quickly add search engine capabilities to a Web site.
ht://Dig works by traversing a Web site and creating a database of all the unique words it finds as it follows hyperlinks from one page to another. This database, together with information on the URL associated with each document, is created every time you request a re-indexing of the site, and is merged with the results of previous index runs to create the foundation for the search engine.
Every time a search is executed, this database is scanned for matches to the search string and a list of results retrieved. The matches are further ranked according to an internal scoring system to filter down to the most relevant, and the results returned to the user, together with links to the pages on which the matches occurred. The process, though somewhat complicated, is nonetheless extremely fast and - thanks to intelligent search algorithms and scoring systems - also very accurate.
ht://Dig also supports Boolean searches, which make it possible to selectively widen or close a search; fuzzy searching, in which the search is automatically expanded to include similar-sounding words, synonyms and plurals; depth-limited searching, in which only documents which are at a particular depth from the tree root are searched; and META-tag indexing for more accurate search results. Both search and result pages can be extensively customized in the ht://Dig system, and - since the source code is freely available under the GPL - developers can even modify and enhance the application to their own specific needs.
Now that you have the background - let's get to work, by installing and configuring ht://Dig.
Source Control
The first order of business to install ht://Dig on the Linux box you plan to use as a Web server. Drop by the official ht://Dig Web site at http://www.htdig.org/ and get yourself the latest stable release of the software (this tutorial uses ht://Dig 3.1.6). Note that you will need a C compiler and a running Web server in order to use the software (this tutorial uses GCC 3.2 and Apache 1.3.26).
Once you've downloaded the source code archive to your Linux server, log in as "root"
$ su -
Password: ****
and extract the source to a temporary directory.
$ cd /tmp
$ tar -xzvf /home/me/htdig-3.1.6.tar.gz
The next step is to configure the package using the provided "configure" script. Before doing this, though, there are a couple of decisions you need to make.
There are two primary components to ht://Dig: the binaries used to index the site and create the database of search words, and the program used to perform a search on this database and return a result set. The indexing tools, and the database that results from their use, can be placed anywhere in the filesystem, but the search binary must be located in the Web server's CGI directory. Additionally, the images used in the result page created after an ht://Dig search must also be located under the Web server root, so that they appear correctly when the page is viewed through a Web browser (assuming, of course, that you're using the default result page templates).
Given this information, and assuming the Web server is located in "/usr/local/apache/", the server's CGI area is "/usr/local/apache/cgi-bin/" and the server's document root is "/usr/local/apache/htdocs/", you will need to give the "configure" script the following arguments:
$ cd /tmp/htdig-3.1.6
$ ./configure --prefix=/usr/local/htdig --with-cgi-bin-dir=/usr/local/apache/cgi-bin/ --with-image-dir=/usr/local/apache/htdocs/htdig/images --with-image-url-prefix=/htdig/images --with-search-dir=/usr/local/apache/htdocs/htdig/sample
This tells the system to install the indexing tools to "/usr/local/htdig/", the CGI search binary to "/usr/local/apache/cgi-bin/", and the result page images and a sample search form to directories under "/usr/local/apache/htdocs/htdig/".
In case the "configure" script barfs and spits messages at you about "installing the libstdc++ library", and if you're sure the library is already installed (the default situation if you're using GCC 3.x), you can try modifying the command above to include some additional variables:
$ cd /tmp/htdig-3.1.6
$ CXXFLAGS=-Wno-deprecated CPPFLAGS=-Wno-deprecated ./configure --prefix=/usr/local/htdig --with-cgi-bin-dir=/usr/local/apache/cgi-bin/ --with-image-dir=/usr/local/apache/htdocs/htdig/images --with-image-url-prefix=/htdig/images --with-search-dir=/usr/local/apache/htdocs/htdig/sample
Next, compile and install it.
$ make
$ make install
ht://Dig should now have been installed to the directory "/usr/local/htdig".
You can verify this by doing a quick directory scan of that directory - here's what you should see.
$ ls -lR /usr/local/htdig/
total 16
drwxr-xr-x 2 root root 4096 Oct 15 18:32 bin/
drwxr-xr-x 2 root root 4096 Oct 15 18:39 common/
drwxr-xr-x 2 root root 4096 Oct 15 18:32 conf/
drwxr-xr-x 2 root root 4096 Oct 15 18:44 db/
/usr/local/htdig/bin:
total 2860
-rwxr-xr-x 1 root root 580424 Oct 15 18:32 htdig*
-rwxr-xr-x 1 root root 580424 Oct 15 18:32 htdump*
-rwxr-xr-x 1 root root 390930 Oct 15 18:32 htfuzzy*
-rwxr-xr-x 1 root root 580424 Oct 15 18:32 htload*
-rwxr-xr-x 1 root root 381489 Oct 15 18:32 htmerge*
-rwxr-xr-x 1 root root 376361 Oct 15 18:32 htnotify*
-rwxr-xr-x 1 root root 2158 Oct 15 18:32 rundig*
/usr/local/htdig/common:
total 6248
-rw-r--r-- 1 root root 84 Oct 15 18:32 bad_words
-rw-r--r-- 1 root root 923308 Oct 15 18:32 english.0
-rw-r--r-- 1 root root 5756 Oct 15 18:32 english.aff
-rw-r--r-- 1 root root 197 Oct 15 18:32 footer.html
-rw-r--r-- 1 root root 891 Oct 15 18:32 header.html
-rw-r--r-- 1 root root 194 Oct 15 18:32 long.html
-rw-r--r-- 1 root root 1404 Oct 15 18:32 nomatch.html
-rw-r--r-- 1 root root 2285568 Oct 15 18:39 root2word.db
-rw-r--r-- 1 root root 67 Oct 15 18:32 short.html
-rw-r--r-- 1 root root 14481 Oct 15 18:32 synonyms
-rw-r--r-- 1 root root 90112 Oct 15 18:39 synonyms.db
-rw-r--r-- 1 root root 1275 Oct 15 18:32 syntax.html
-rw-r--r-- 1 root root 3022848 Oct 15 18:39 word2root.db
-rw-r--r-- 1 root root 1108 Oct 15 18:32 wrapper.html
/usr/local/htdig/conf:
total 12
-rw-r--r-- 1 root root 8580 Oct 15 18:42 htdig.conf
/usr/local/htdig/db:
total 236
-rw-r--r-- 1 root root 63488 Oct 15 18:44 db.docdb
-rw-r--r-- 1 root root 11991 Oct 15 18:42 db.docs
-rw-r--r-- 1 root root 5120 Oct 15 18:44 db.docs.index
-rw-r--r-- 1 root root 54004 Oct 15 18:44 db.wordlist
-rw-r--r-- 1 root root 82944 Oct 15 18:44 db.words.db
The search binary should have been installed to "/usr/local/apache/cgi-bin/htsearch",
$ ls -l /usr/local/apache/cgi-bin
total 560
-rwxr-xr-x 1 root root 558796 Oct 15 18:32 htsearch*
-rw-r--r-- 1 root root 268 Aug 18 16:37 printenv
-rw-r--r-- 1 root root 757 Aug 18 16:37 test-cgi
with a sample search form and images to "/usr/local/apache/htdocs/htdig/".
For an explanation of what each binary does, visit the ht://Dig documentation, at http://www.htdig.org/
Once you've got ht://Dig installed, the next step is to configure it and start indexing your site. Let's look at that next.
Building An Index
ht://Dig is configured via a single configuration file, named "htdig.conf" and located in the installation's "conf" directory. Most of the time, this configuration file is set up automatically based on the arguments you passed to the "configure" script, and only needs to be altered to reflect the URL at which indexing should begin.
Pop open this file in your favourite text editor, and look for the "start_url" variable:
#
# This specifies the URL where the robot (htdig) will start. You can specify
# multiple URLs here. Just separate them by some whitespace.
# The example here will cause the ht://Dig homepage and related pages to be
# indexed.
# You could also index all the URLs in a file like so:
# start_url: `${common_dir}/start.url`
#
start_url: http://localhost/
Alter this variable to reflect the URL at which indexing should begin, and save the changes back to the file.
You can also alter a number of other variables that control ht://Dig behaviour through the configuration file. Amongst other things, you can modify the location for the search database, specify a list of URLs and extensions to be bypassed while indexing, enable or disable the fuzzy logic algorithms, limit the amount of content stored in the search database and control the maximum amount of data read over an HTTP connection. For more information on these variables, examine the notes in the configuration file, and also take a look at the ht://Dig documentation, at http://www.htdig.org/
The next step is to actually build the search database. As noted previously, when indexing a Web site, ht://Dig recursively spiders the site(s) and builds an index of all the unique words it finds. This process is activated via the "rundig" script, found in the installation's "bin" directory:
$ /usr/local/htdig/bin/rundig
New server: localhost, 80
0:0:0:http://localhost/: +* size = 487
1:1:1:http://localhost/company/: -+++* size = 2867
2:2:2:http://localhost/services/: -***+++++- size = 5219
...
htmerge: Sorting...
htmerge: Merging...
htmerge: 100:creative
htmerge: 200:good
htmerge: 300:online
htmerge: 400:specifically
...
htfuzzy/endings: words: 13200
htfuzzy/endings
htfuzzy/synonyms: 1519 worshipping
htfuzzy/synonyms: Done.
htfuzzy: Done.
The "rundig" script looks up the configuration file to figure out which URL to use as the root for indexing, and begins traversing and scanning the pages under that URL.
Once it's done, the search database will have been created (in the installation's "db" directory) and is ready for use. The next step is to integrate the ht://Dig search form and form processor into the Web site.
A Well-Formed Plan
When ht://Dig is first installed, a sample search form is automatically installed into the directory specified via the "--with-search-dir" configuration parameter. In this particular example, I had specified that the form be installed to "/usr/local/apache/htdocs/htdig/sample" - so trot on over there and take a look inside:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
`<html>`
`<head>`
`<title>`ht://Dig WWW Search</title>
</head>
`<body bgcolor="#eef7ff">`
`<h1>`
`<a href="http://www.htdig.org">``<IMG SRC="/htdig/images/htdig.gif" align="bottom" alt="ht://Dig" border="0">`</a>
WWW Site Search</h1>
`<hr noshade size="4">`
This search will allow you to search the contents of
all the publicly available WWW documents at this site.
`<br>`
`<p>`
`<form method="post" action="/cgi-bin/htsearch">`
`<font size="-1">`
Match: `<select name="method">`
`<option value="and">`All
`<option value="or">`Any
`<option value="boolean">`Boolean
</select>
Format: `<select name="format">`
`<option value="builtin-long">`Long
`<option value="builtin-short">`Short
</select>
Sort by: `<select name="sort">`
`<option value="score">`Score
`<option value="time">`Time
`<option value="title">`Title
`<option value="revscore">`Reverse Score
`<option value="revtime">`Reverse Time
`<option value="revtitle">`Reverse Title
</select>
</font>
`<input type="hidden" name="config" value="htdig">`
`<input type="hidden" name="restrict" value="">`
`<input type="hidden" name="exclude" value="">`
`<br>`
Search:
`<input type="text" size="30" name="words" value="">`
`<input type="submit" value="Search">`
</form>
`<hr noshade size="4">`
</body>
</html>
When you view this form through your Web browser, you should see something like this:
Enter a search string into the form field, and ht://Dig should go to work processing your search request. Here's what the result looks like:
Needless to say, you can customize this output, and even the manner in which the search is carried out. If, for example, you tell ht://Dig to display the results in "short" rather than "long" format, you'll see something like this:
You can also perform a Boolean search, simply by selecting "Boolean" from the drop-down list:
Custom Job
ht://Dig allows you to customize both the search form, and the result page generated from a query. In order to demonstrate, I'll create a plain-vanilla search form, called "search.html", which looks like this:
`<form method="post" action="/cgi-bin/htsearch">`
`<input type="text" name="words" size="15">`
`<input type="submit" value="Begin Search">`
</form>
There are a couple of important things to note here. The first is the ACTION attribute of the