On October 31, 2006 08:53 am, Teodor Sigaev wrote: > > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent > > a self contained test case directly to Teodor which shows the error. > > > > 'ERROR: index row requires 8792 bytes, maximum size is 8191' > > Uh, I see. But I'm really surprised why do you use pg_trgm on big text? > pg_trgm is designed to find similar words and use technique known as > trigrams. This will work good on small pieces of text such as words or set > expression. But all big texts (on the same language) will be similar :(. > So, I didn't take care about guarantee that index tuple's size limitation. > In principle, it's possible to modify pg_trgm to have such guarantee, but > index becomes lossy - all tuples gotten from index should be checked by > table's tuple evaluation. The problem is some of the data we are working with is not strictly "text" but bytea that we've run through encode(bytea, 'escape'), and we've had to resort to trigrams in an attempt to mimic LIKE for searches. From our findings tsearch2 does not match partial words, in the same way that a LIKE would. ie col LIKE 'go%' would match good, gopher. pg_tgrm will return those with the limit set appropriately, but tsearch2 does not. > > If you want to search similar documents I can recommend to have a look to > fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty > close to trigrams and metrics of similarity is the same, but uses another > signature calculations. And, there are some tips and trics: removing HTML > marking,removing punctuation, lowercasing text and so on - it's interesting > and complex task. -- Darcy Buskermolen Command Prompt, Inc. Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/