Search Web......

Lucene: A Powerfull & Free Search Engine by Apache

Evaluating search engines

Other widely used open source search engines include Swish-E, Glimpse, libibex, freeWAIS, and iSearch. Like any software package, each is optimized for use in particular situations; it is often difficult to deploy these tools outside of their intended domains. Consider the following features when evaluating a search engine:

  • Incremental versus batch indexing: Some search engines only support batch indexing; once they create an index for a set of documents, adding new documents becomes difficult without reindexing all the documents. Incremental indexing allows easy adding of documents to an existing index. For some applications, like those that handle live data feeds, incremental indexing is critical. Lucene supports both types of indexing.

  • Data sources: Many search engines can only index files or Webpages. This handicaps applications where indexed data comes from a database, or where multiple virtual documents exist in a single file, such as a ZIP archive. Lucene allows developers to deliver the document to the indexer through a String or an InputStream, permitting the data source to be abstracted from the data. However, with this approach, the developer must supply the appropriate readers for the data.

  • Indexing control: Some search engines can automatically crawl through a directory tree or a Website to find documents to index. While this is convenient if your data is already stored in this manner, crawler-based indexers often provide limited flexibility for applications that require fine-grained control over the indexed documents. Since Lucene operates primarily in incremental mode, it lets the application find and retrieve documents.

  • File formats: Some search engines can only index text or HTML documents; others support a filter mechanism, which offers a simple alternative to indexing word processing documents, SGML documents, and other file formats. Lucene supports such a mechanism.

  • Content tagging: Some search engines treat a document as a single stream of tokens; others allow the specification of multiple data fields within a document, such as "subject," "abstract," "author," and "body." This permits semantically richer queries like "author contains Hamilton AND body contains Constitution." Lucene supports content tagging by treating documents as collections of fields, and supports queries that specify which field(s) to search.

  • Stop-word processing: Common words, such as "a," "and," and "the," add little value to a search index. But since these words are so common, cataloging them will contribute considerably to the indexing time and index size. Most search engines will not index certain words, called stop words. Some use a list of stop words, while others select stop words statistically. Lucene handles stop words with the more general Analyzer mechanism, to be described later, and provides the StopAnalyzer class, which eliminates stop words from the input stream.

  • Stemming: Often, a user desires a query for one word to match other similar words. For example, a query for "jump" should probably also match the words "jumped," "jumper," or "jumps." Reducing a word to its root form is called stemming. Lucene does not yet implement stemming, but you could easily add a stemmer through a more sophisticated Analyzer class.

  • Query features: Search engines support a variety of query features. Some support full Boolean queries; others support only and queries. Some return a "relevance" score with each hit. Some can handle adjacency or proximity queries -- "search followed by engine" or "Knicks near Celtics" -- others can only search on single keywords. Some can search multiple indexes at once and merge the results to give a meaningful relevance score. Lucene supports a wide range of query features, including all of those listed above. However, Lucene does not support the valuable Soundex, or "sounds like," query.

  • Concurrency: Can multiple users search an index at the same time? Can a user search an index while another updates it? Lucene allows users to search an index transactionally, even if another user is simultaneously updating the index.

  • Non-English support: Many search engines implicitly assume that English is the target language; this is evident in areas such as stop-word lists, stemming algorithms, and the use of proximity to match phrase queries. As Lucene preprocesses the input stream through the Analyzer class provided by the developer, it is possible to perform language-specific filtering.