Re: fts, compond words?

Mike Rylander <mrylander@xxxxxxxxx> · Thu, 8 Dec 2005 16:09:51 +0000

On 12/8/05, Teodor Sigaev <teodor@xxxxxxxxx> wrote:
> > (a + foo1 + bar) | (a + foo2 + bar)
>
> That a simple case, what about languages as norwegian or german? They has
> compound words and ispell dictionary can split them to lexemes. But, usialy
> there is more than one variant of separation:
>
> forbruksvaremerkelov
>         forbruk vare merke lov
>         forbruk vare merkelov
>         forbruk varemerke lov
>         forbruk varemerkelov
>         forbruksvare merke lov
>         forbruksvare merkelov
> (notice: I don't know translation, just an example. When we working on compound
> word support we found word which has 24 variant of separation!!)
>
> So, query 'a + forbruksvaremerkelov' will be awful:
>
> a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... )
>
> Of course, that is examle just from mind, but solution of phrase search should
> work reasonably with such corner cases.
>

WARNING: What follows is wild, hand waving speculation as I don't
fully understand the implications of compound words! ;-)

My naive impression is that it would be both possible and a good idea
to stem any compound words to their versions containing the most
individual lexemes.  As an analogy, this would be similar to
transforming composed (Normalization Form C) UTF-8 characters into
their decomposed (Normalization Form D) versions.