Re: [HACKERS] Index greater than 8k

Darcy Buskermolen <darcyb@xxxxxxxxxxxxxxxxx> · Tue, 31 Oct 2006 10:49:37 -0800

On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> > a self contained test case directly to  Teodor  which shows the error.
> >
> > 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
>
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will work good on small pieces of text such as words or set
> expression. But all big texts (on the same language) will be similar :(.
> So, I didn't take care about guarantee that index tuple's size limitation.
> In principle, it's possible to modify pg_trgm to have such guarantee, but
> index becomes lossy - all tuples gotten  from index should be checked by
> table's tuple evaluation.

The problem is some of the data we are working with is not strictly "text" but 
bytea that we've run through encode(bytea, 'escape'), and we've had to resort 
to trigrams in an attempt to mimic LIKE for searches.  From our findings 
tsearch2 does not match partial words,  in the same way that a LIKE would. ie 
col LIKE 'go%' would match good, gopher.  pg_tgrm will return those with the 
limit set appropriately, but tsearch2 does not.

>
> If you want to search similar documents I can recommend to have a look to
> fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
> close to trigrams and metrics of similarity is the same, but uses another
> signature calculations. And, there are some tips and trics: removing HTML
> marking,removing punctuation, lowercasing text and so on - it's interesting
> and complex task.

-- 
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/