On 12/8/05, Teodor Sigaev <teodor@xxxxxxxxx> wrote: > > (a + foo1 + bar) | (a + foo2 + bar) > > That a simple case, what about languages as norwegian or german? They has > compound words and ispell dictionary can split them to lexemes. But, usialy > there is more than one variant of separation: > > forbruksvaremerkelov > forbruk vare merke lov > forbruk vare merkelov > forbruk varemerke lov > forbruk varemerkelov > forbruksvare merke lov > forbruksvare merkelov > (notice: I don't know translation, just an example. When we working on compound > word support we found word which has 24 variant of separation!!) > > So, query 'a + forbruksvaremerkelov' will be awful: > > a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... ) > > Of course, that is examle just from mind, but solution of phrase search should > work reasonably with such corner cases. > WARNING: What follows is wild, hand waving speculation as I don't fully understand the implications of compound words! ;-) My naive impression is that it would be both possible and a good idea to stem any compound words to their versions containing the most individual lexemes. As an analogy, this would be similar to transforming composed (Normalization Form C) UTF-8 characters into their decomposed (Normalization Form D) versions.