On Thu, Feb 15, 2018, at 13:08, Sebastian Hagedorn wrote: > Is the setting "search_skipdiacrit" in imapd.conf honored during the > indexing or is that only relevant while searching? Given your comment > regarding search normalization above I take it Umlaut characters are not > considered diacriticals? It's not a huge issue, but as a German university > it would be nice for our users if a search could distinguish between > "hatte" and "hätte", as an example. Cyrus considers Umlaut characters as diacriticals (I was just handwaving that away in my previous comment due to the default settings). The skip_diacrit setting applies to both indexing and search. As an example, let's append two emails to a mailbox. The body of message 1 contains the German verb "gären". Message 2 contains the verb "garen" (for the non-German speakers: these verbs mean two different things). With skip_diacrit set to true (the default), this is what lands in the Xapian database: [...] Zgaren garen and searches for "garen" and "gären" will both match both messages. With skip_diacrit set to false, however, we get [...] Zgaren Zgären garen gären and searches for "garen" and "gären" will only match the respective messages. I uploaded a new test to Cassandane that demonstrates this [1] (the subject_isutf8 test case might also be of interest). I'd just deactivate search_skipdiacrit if you are sure that your users will benefit from it. If in doubt, I would rather err on the safe side and return false positives by skipping diacritics (the default). There's more to say about the Z prefixes: Cyrus currently uses the English stemmer for all text, resulting in stem terms that typically match their non-stemmed original input for non-English text. While this might seem odd, it's the best we can do without proper language detection for both indexing and search. I implemented multi-language stem support in an experimental feature branch, but didn't resolve the issues around fingerprinting search queries, yet. There's an open issue to track this [2]. [1] https://github.com/cyrusimap/cassandane/blob/master/Cassandane/Cyrus/SearchFuzzy.pm#L403 [2] https://github.com/cyrusimap/cyrus-imapd/issues/72 > Just out of curiosity, how is the mapping between a Xapian docid and a > message file on disk achieved? I played around with xapian-delve and the > Perl example simplesearch.pl. When I search a term, I get a list of > docid's, but how do I know which message that is? In 3.x, Cyrus search stores an internal unique message id, called guid, as docid in Xapian. The guid currently is a SHA-1 hash of the raw message, allowing for deduplication and to avoid re-indexing already seen messages. The conversations.db of a user maps this guid to a list of mailbox:UID pairs. Off the top of my head, there currently isn't an "official" way in Cyrus to retrieve the mailbox:UID list for a given guid outside the Cyrus process. Depending on your use case, you could either: 1.) build your custom mapper on imap/conversations.h, 2.) use cvt_cyrusdb to dump the contents of a conversations.db into plain text. Or 3.) use the JMAP layer to fetch JMAP-formatted message or the raw message blob by id. For JMAP email, use the guid and prefix it with 'M' in an Email/get method. For blobs, use 'G' as prefix. Both are "unofficial": we might change the JMAP id scheme in future releases. But I guess this isn't going to happen any time soon, if ever. Hope it helps, Robert ---- Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus