Re: fts, compond words?

Mike Rylander <mrylander@xxxxxxxxx> · Mon, 12 Dec 2005 14:43:14 +0000

On 12/12/05, Marcus Engene <mengpg@xxxxxxxxx> wrote:
>  > That a simple case, what about languages as norwegian or german? They
>  > has compound words and ispell dictionary can split them to lexemes.
>  > But, usialy there is more than one variant of separation:
>  >
>  > forbruksvaremerkelov
>  > forbruk vare merke lov
>  > forbruk vare merkelov
>  > forbruk varemerke lov
>  > forbruk varemerkelov
>  > forbruksvare merke lov
>  > forbruksvare merkelov
>  > (notice: I don't know translation, just an example. When we working
> on > compound word support we found word which has 24 variant of
>  > separation!!)
>  >
>  > So, query 'a + forbruksvaremerkelov' will be awful:
>  >
>  > a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) |
> ... )
>  >
>  > Of course, that is examle just from mind, but solution of phrase
>  > search should work reasonably with such corner cases.
>
> (Sorry for replying in the wrong place in the thread, I was away for a
> trip and unsubscribed meanwhile)
>
> I'm a swede and swedish is similair to norweigan and german. Take this
> example:
>
> lång hårig kvinna
> långhårig kvinna
>
> Words are put together to make a new word with different meaning. The
> first example means "tall hairy woman" and the second is "woman with
> long hair". If I would be on f.ex a date site, I'd want the distinction.
> ;-) If not, i should enter both strings
> ("lång hårig" | långhårig) & kvinna
> ...which is perfectly acceptable.

Well, that certainly kills my initial naive implementation plan! :-) 
Thank you for the explanation.

[thinking]

Well, if compound words should always be treated as the user has
inserted them then it seems that the current implementation may be
doing the wrong thing with regard to stemming compound words.  If the
compound words are being decomposed to constituent stems then you'd be
getting semantically, or at least contextually, incorrect results,
right?  (Again, not an expert here. :-) )

[thinking more...]

So, assuming that compound words should not be fully stemmed, due to
the way they are used to create new words with different meanings, if
step (4) were removed from my earlier plan then everything would
continue to work as proposed.

>
> IMHO I don't see any point in splitting these words.
>
>
> Let's go back to the subject, what about a syntax like this:
>
> idxfti @@ to_tsquery('default', 'pizza & (Chicago | [New York]')
>
> Ie the exact match string is always atomic. Wouldn't that be doable
> without any logical implications?
>

I think there are several ways that phrase matching can be done in a
logically consistent way.  That is certainly one of them, and takes
the focus off a single infix operator.  TS2 already recognises
grouping operations via parens, and restricting brackets ([,]) to
surrounding only simple expressions (no '&', '|', '!' or '()')
shouldn't be too hard.  However, I'd still prefer that proximity
searches could be specified more explicitly by the user.  Using the
above example:

pizza & (Chicago | [New York])

  becomes

pizza & (Chicago | New + Your)

  which is implicitly

pizza & (Chicago | New +{follows;dist=1} York)

  and that is read as: "Pizza, and chicago, or new followed by york at
a distance of 1."

where the modifier to the '+' operator could be specified by the user
initially if desired.

While I understand and agree that "phrase searching" would be the most
common use for proximity+direction operator modifiers, I see things
like the '+' operator and '[]' groupings as special cases of the more
generalized restriction operation (or set thereof) based on the
positional information recorded in (unstripped) indexes.

Thoughts?

> Best regards,
> Marcus
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
Mike Rylander
mrylander@xxxxxxxxx
GPLS -- PINES Development
Database Developer
http://open-ils.org