Official: Strigi fastest and smallest

2007-01-17

Today two Sun employees, Michal Pryc and Steven Xusheng Hou, published a comparison of four desktop indexers: Beagle, JIndex, Tracker and of course Strigi. The work is really extensive and is meant for Sun internally as well as feedback to the developers of the software.

The document is good news for Strigi. The study shows that it uses the smallest amount of RAM (but Tracker uses just as little if you consider the error margin, the other two used at least 15 times as much RAM) and that it is way faster than the rest. Please look up Table 5 in the document.

Here I reproduce its contents:

                                     Beagle                 JIndex        Tracker          Strigi
Number/size of TXT files        10 000 / 168MB         10 000 / 168MB 10 000 / 168MB 10 000 / 168MB
Size of the index database            62MB                  93MB           140MB          119MB
Time of indexing [hr:min:sec]       02:18:05               03:02:55       03:03:14       00:04:26
CPU TIME [hr:min:sec]               00:12:05               00:09:15       02:22:40       00:03:44
Average CPU usage                    8.79%                    5%          77.73%         82.75%

Why is Strigi so fast? Two reasons: first, it does not artificially slow down but runs in the background and lets the Linux kernel decide when it can run. Because the indexer of Strigi has the lowest possible CPU priority, so the user does not notice Strigi working. This is why it is 30x as fast as Beagle and 40x as fast as Tracker. And the total amount of CPU used is also 2.5 as little as the number two, JIndex.

Second, the way Strigi extracts data is simply very efficient. And the good news is that the code that does this is available as a library under LGPL. So the other search engines have no excuses for being so much slower. They too can be lightning fast and I encourage them to apply the grease called libstreamindexer.

This awesome speed is very nice and bodes very well for KFileMetaInfo, the KDE class that provides metadata about files, since I’m currently working on letting it use Strigi as the source for the metadata.

I want to give a thanks to Michal and Steven for this great comparison!

Comments

I'm not sure how useful it

I'm not sure how useful it is to cite table 5, particularly as from memory, Beagle indexes over a long period of time so as to minimise the impact of the indexing.

Also, while the report lists a number of good points for Strigi, they also list a number of issues that need to be addressed!

indeed

For sure! The report is very useful in showing problems too. The main point that must be addressed the low result count. This is something I was totally unaware of. The reason for that is that I've not come around to writing unit tests to test search reliability. This is the first thing to pay attention to after the KDE metainfo work is finished.

Nevertheless, the overall impression is very good. Most negative points are rather vague and revolve around smaller issues. Please forgive me for being overjoyed at the huge speed differences.

Excellent

Strigi should get much better once those unit tests are done! How about also creating some way for users to help test it and add to the unit tests? Perhaps you could provide a utility program, strigidebug that works like this:

$ strigidebug URI query

When run, strigidebug launches a daemon, creates a separate index for URI and passes 'query' to the daemon. It then prints out all matches for 'query' in URI. (And finally kills the daemon and deletes the temporary index.)

So if I, as a user, discover that strigi fails to report the file "foo.txt" as a match for the query "bar", then I can run strigidebug to verify that there is indeed a problem (as opposed to e.g. foo.txt not yet being in the index) and submit a clear bug report, like this:

"Running 'strigidebug foo.txt bar' prints out '[No match]'. Expected result: 'foo.txt'. Attached file: foo.txt"

You could then add this bug report as a unit test, once the bug (if it really is a bug) is fixed. Also false positives could be treated in the same way, of course.

Feedback

One thing that people complained about was that it may take 2 hours before the document I saved 10 minutes ago shows up in the results. Meaning that I can't find a document even if its there, which degrades the user experience and people stop depending on search.

In other words; having a really fast turnaround between storing the file on disk and it being found in the indexer is very important.

So, I'd say that the research is very relevant.

dataset/index size relation

If it's not specific to this dataset and the algorithms the different search engines use, the comparison shows also, that there's need to slim down the size of the index db. I mean, I know that such an index takes a lot of space, but 168/119Mb is quite a lot - especially, if the index grows linearly or faster with larger datasets.

Interesting are the complaints about Cmake, non-ANSI/POSIX code and the GUI.

re: dataset/index size relation

It's certainly possible to make the index smaller. There are two methods for this. The main reason the index is relatively large is that the full text that is extracted from a file is stored in the index. This not necessary for files that are on the local disk and could be avoided. Additionally, it is possible to gzip or bzip2 the full text in cases where it is necessary to store it.

CMake is not everywhere yet. For now I'm staying with version 2.4.3 as the minimal version. There are binary packages for Strigi and CMake has no deps itself, so I do not agree this is a problem.

The complaint about non-ANSI/POSIX is not further explained and I cannot act on it. To the best of my knowledge, Strigi compiles well no different compilers. One of the reporters did send me some patches for the Sun Forte compiler which I applied.

The GUI is not perfect, this is true. It's a matter of time for a good GUI. Strigi is mainly about the search daemon and the GUI that comes with Strigi is mainly a showcase of features.

compress is speed

I recall that reading from relatively slow mediums and having a fast compression can mean that the actual time spent is moved from disk-latency to cpu time. Which can mean in many cases compressing can give faster search times.
So, if you aim for decompress inside the load method, that would be worth some research, I'd say :)

Strigi -- best KDE app ever

Just wanted to let you know that I started using Strigi about six months ago (with the little strigi search box in the lower right of my desktop panel) and I was totally blown away. Since then I can't recall using grep or locate even once! Thanks for such a great application. I haven't been this excited over an app since the KIO slaves.