Re: Truncated text during Xapian indexing

Robert Stepanek <rsto@xxxxxxxxxxxxxxxx> · Thu, 15 Feb 2018 16:12:23 +0100

On Thu, Feb 15, 2018, at 13:08, Sebastian Hagedorn wrote:
> Is the setting "search_skipdiacrit" in imapd.conf honored during the 
> indexing or is that only relevant while searching? Given your comment 
> regarding search normalization above I take it Umlaut characters are not 
> considered diacriticals? It's not a huge issue, but as a German university 
> it would be nice for our users if a search could distinguish between 
> "hatte" and "hätte", as an example.

Cyrus considers Umlaut characters as diacriticals (I was just handwaving that away in my previous comment due to the default settings). The skip_diacrit setting applies to both indexing and search.

As an example, let's append two emails to a mailbox. The body of message 1 contains the German verb "gären". Message 2 contains the verb "garen" (for the non-German speakers: these verbs mean two different things).

With skip_diacrit set to true (the default), this is what lands in the Xapian database:

   [...] Zgaren garen

and searches for "garen" and "gären" will both match both messages.

With skip_diacrit set to false, however, we get

  [...] Zgaren Zgären garen gären

and searches for "garen" and "gären" will only match the respective messages.

I uploaded a new test to Cassandane that demonstrates this [1] (the subject_isutf8 test case might also be of interest). I'd just deactivate search_skipdiacrit if you are sure that your users will benefit from it. If in doubt, I would rather err on the safe side and return false positives by skipping diacritics (the default).

There's more to say about the Z prefixes: Cyrus currently uses the English stemmer for all text, resulting in stem terms that typically match their non-stemmed original input for non-English text. While this might seem odd, it's the best we can do without proper language detection for both indexing and search. I implemented multi-language stem support in an experimental feature branch, but didn't resolve the issues around fingerprinting search queries, yet. There's an open issue to track this [2].

[1] https://github.com/cyrusimap/cassandane/blob/master/Cassandane/Cyrus/SearchFuzzy.pm#L403
[2] https://github.com/cyrusimap/cyrus-imapd/issues/72

> Just out of curiosity, how is the mapping between a Xapian docid and a 
> message file on disk achieved? I played around with xapian-delve and the 
> Perl example simplesearch.pl. When I search a term, I get a list of 
> docid's, but how do I know which message that is?

In 3.x, Cyrus search stores an internal unique message id, called guid, as docid in Xapian. The guid currently is a SHA-1 hash of the raw message, allowing for deduplication and to avoid re-indexing already seen messages. The conversations.db of a user maps this guid to a list of mailbox:UID pairs.

Off the top of my head, there currently isn't an "official" way in Cyrus to retrieve the mailbox:UID list for a given guid outside the Cyrus process. Depending on your use case, you could either: 1.) build your custom mapper on imap/conversations.h, 2.) use cvt_cyrusdb to dump the contents of a conversations.db into plain text. Or 3.) use the JMAP layer to fetch JMAP-formatted message or the raw message blob by id. For JMAP email, use the guid and prefix it with 'M' in an Email/get method. For blobs, use 'G' as prefix. Both are "unofficial": we might change the JMAP id scheme in future releases. But I guess this isn't going to happen any time soon, if ever.

Hope it helps,
Robert
----
Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
To Unsubscribe:
https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus