Search Web......

Indexing Speed Factors

If you are using Lucene in a non-trivial application, you will want to ensure optimal indexing performance. The bottleneck of a typical text-indexing application is the process of writing index files onto a disk. Therefore, we need to instruct Lucene to be smart about adding and merging segments while indexing documents.

When new documents are added to a Lucene index, they are initially stored in memory instead of being immediately written to the disk. This is done for performance reasons. The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together. With the default value of 10, Lucene will store 10 documents in memory before writing them to a single segment on the disk. The mergeFactor value of 10 also means that once the number of segments on the disk has reached the power of 10, Lucene will merge these segments into a single segment. (There is a small exception to this rule, which I shall explain shortly.)

For instance, if we set mergeFactor to 10, a new segment will be created on the disk for every 10 documents added to the index. When the 10th segment of size 10 is added, all 10 will be merged into a single segment of size 100. When 10 such segments of size 100 have been added, they will be merged into a single segment containing 1000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each power of 10 index size.

The exception noted earlier has to do with another IndexWriter instance variable: maxMergeDocs. While merging segments, Lucene will ensure that no segment with more than maxMergeDocs is created. For instance, if we set maxMergeDocs to 1000, when we add the 10,000th document, instead of merging multiple segments into a single segment of size 10,000, Lucene will create a 10th segment of size 1000, and keep adding segments of size 1000 for every 1000 documents added.

The default value of maxMergeDocs is Integer.MAX_VALUE. In my experience, one rarely needs to change this value.

Now that I have explained how mergeFactor and maxMergeDocs work, you can see that using a higher value for mergeFactor will cause Lucene to use more RAM, but will let Lucene write data to disk less frequently, which will speed up the indexing process. A smaller mergeFactor will use less memory and will cause the index to be updated more frequently, which will make it more up-to-date, but will also slow down the indexing process. Similarly, a larger maxMergeDocs is better suited for batch indexing, and a smaller maxMergeDocs is better for more interactive indexing.