Musicbrainz have been looking at improving the performance of their search server. When search queries are sent to Musicbrainz they do not access the database, instead they access a Search Index that is built from the database and contains the information to be searched and allows full test search and other features not usually available from a database.
Currently Lucene is used, but is accessed using Python with as pylucene. In an attempt to boost performance a simplified version of the search code was developed using pure Java available from here, but this still didn't seem to be giving the required performance enhancements.
Ive tried some tuning mechanisms and code changes to see if I can make some improvements or at least get some benchmark figures.
The tests were performed on a MacBook Pro 2.66Ghz Core 2 Duo with 4GB of 1067 DDR3 RAM using OSX 10.5, So it is a good lab top but doesn't compare so well with a desktop of server. (I used a Macbook because it was 64bit so I could use 64 bit Java to address more than 2GB of memory whereas my Window PC is not, and I have no native linux desktop).
Summary of results:
Track OR query:Single Threaded test with insufficient memory, index on hard drive :1.91 query /sec
Track OR query:Single Threaded test with enough memory, index on hard drive :24.66 track query /sec
Track OR query:Best Multi Threaded test :43.96 queries /sec
Track query:Best Multi Threaded test :59.47 queries /sec
Release query:Threaded test :252.75 queries /sec
I created a test set of 10,011 track titles and 10,011 releases from the database, and I rebuilt the indexes with StopWords Filter removed to solve the stop words bug. I then created a test program to fire requests to the jetty (servlet searcher) from multiple threads, but I was getting inconsistent results and with Jetty not being the thing that was being tested I removed this from the equation.
I restarted by creating a test program that loaded pairs of track/release records into a queue, and then creating a number of threads to read the next pair of the queue and then send these directly to a Search class, this would perform the search and find the best hits and then return, the test results are for this setup. I found myself running a few tests, then making some more code changes to see if it made any difference, this is what I found:
Index Directory Location:
The macbook has a SSD drive, and external hard drive, and of course the index could also be loaded in memory. I expected loading the index into memory would give vastly superior performance but when I tried loading the Track Index (about 2.5 GB) the performance was terrible, I concluded that it must be swapping memory because of the memory required by the OS itself. I did some with the smaller Release Index as well, here the RAM Directory performed as well as the others but not any better. The SSD performed much better than the hard drive when I ran tests with insufficient memory, but when enough memory was allocated there was little difference between the two.
Memory allocated to JVM: With the default (of 64MB) it performed very badly on the hard drive, but when I increased it to 2GB there was a big improvement, but further adjustments (up and down) didn't make much difference. So it seems you need a reasonable but not ridiculous memory for decent results.
Code Improvements: Posted a few questions on the lucene mailing list and was given some optimizations for how the index is opened
new IndexSearcher(IndexReader.open(new NIODSDirectory(new File indexDir + "track_Index"),null),true)));
and on iterating through the results.
Opening using the index using this new method doubled the number of queries that could be processed by reducing contention on the index searcher, I haven't yet tried the iterating query improvements.
No of Threads:I ran tests using just a single thread at first then increased the number of threads to find the optimum throughput. With the code improvement in place I found you needed at least 30 threads for the best results , but additional threads didn't give further improvements. If the tests were performed on a Quad CPU or better I expect more threads would give more gains.
Query type:I tried an OR and a simple query against the track index
type=track&query=track:"trackname" OR release:"releasename"
Of course the OR query was slower , but not exponentially so - it was about 30% slower.
I then tried querying the release index, because it is much smaller the results were much better, with a 400% improvement in speed.
The full results are available here
and the amended zip of the code can be found here.