Search Postgresql Archives

Re: [HACKERS] Index greater than 8k

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a self contained test case directly to Teodor which shows the error.
'ERROR:  index row requires 8792 bytes, maximum size is 8191'
Uh, I see. But I'm really surprised why do you use pg_trgm on big text? pg_trgm is designed to find similar words and use technique known as trigrams. This will work good on small pieces of text such as words or set expression. But all big texts (on the same language) will be similar :(. So, I didn't take care about guarantee that index tuple's size limitation. In principle, it's possible to modify pg_trgm to have such guarantee, but index becomes lossy - all tuples gotten from index should be checked by table's tuple evaluation.

If you want to search similar documents I can recommend to have a look to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty close to trigrams and metrics of similarity is the same, but uses another signature calculations. And, there are some tips and trics: removing HTML marking,removing punctuation, lowercasing text and so on - it's interesting and complex task.
--
Teodor Sigaev                                   E-mail: teodor@xxxxxxxxx
                                                   WWW: http://www.sigaev.ru/


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux