On 12/12/05, Marcus Engene <mengpg@xxxxxxxxx> wrote: > > That a simple case, what about languages as norwegian or german? They > > has compound words and ispell dictionary can split them to lexemes. > > But, usialy there is more than one variant of separation: > > > > forbruksvaremerkelov > > forbruk vare merke lov > > forbruk vare merkelov > > forbruk varemerke lov > > forbruk varemerkelov > > forbruksvare merke lov > > forbruksvare merkelov > > (notice: I don't know translation, just an example. When we working > on > compound word support we found word which has 24 variant of > > separation!!) > > > > So, query 'a + forbruksvaremerkelov' will be awful: > > > > a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | > ... ) > > > > Of course, that is examle just from mind, but solution of phrase > > search should work reasonably with such corner cases. > > (Sorry for replying in the wrong place in the thread, I was away for a > trip and unsubscribed meanwhile) > > I'm a swede and swedish is similair to norweigan and german. Take this > example: > > lång hårig kvinna > långhårig kvinna > > Words are put together to make a new word with different meaning. The > first example means "tall hairy woman" and the second is "woman with > long hair". If I would be on f.ex a date site, I'd want the distinction. > ;-) If not, i should enter both strings > ("lång hårig" | långhårig) & kvinna > ...which is perfectly acceptable. Well, that certainly kills my initial naive implementation plan! :-) Thank you for the explanation. [thinking] Well, if compound words should always be treated as the user has inserted them then it seems that the current implementation may be doing the wrong thing with regard to stemming compound words. If the compound words are being decomposed to constituent stems then you'd be getting semantically, or at least contextually, incorrect results, right? (Again, not an expert here. :-) ) [thinking more...] So, assuming that compound words should not be fully stemmed, due to the way they are used to create new words with different meanings, if step (4) were removed from my earlier plan then everything would continue to work as proposed. > > IMHO I don't see any point in splitting these words. > > > Let's go back to the subject, what about a syntax like this: > > idxfti @@ to_tsquery('default', 'pizza & (Chicago | [New York]') > > Ie the exact match string is always atomic. Wouldn't that be doable > without any logical implications? > I think there are several ways that phrase matching can be done in a logically consistent way. That is certainly one of them, and takes the focus off a single infix operator. TS2 already recognises grouping operations via parens, and restricting brackets ([,]) to surrounding only simple expressions (no '&', '|', '!' or '()') shouldn't be too hard. However, I'd still prefer that proximity searches could be specified more explicitly by the user. Using the above example: pizza & (Chicago | [New York]) becomes pizza & (Chicago | New + Your) which is implicitly pizza & (Chicago | New +{follows;dist=1} York) and that is read as: "Pizza, and chicago, or new followed by york at a distance of 1." where the modifier to the '+' operator could be specified by the user initially if desired. While I understand and agree that "phrase searching" would be the most common use for proximity+direction operator modifiers, I see things like the '+' operator and '[]' groupings as special cases of the more generalized restriction operation (or set thereof) based on the positional information recorded in (unstripped) indexes. Thoughts? > Best regards, > Marcus > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Mike Rylander mrylander@xxxxxxxxx GPLS -- PINES Development Database Developer http://open-ils.org