Re: [HACKERS] Index greater than 8k

"Gregory S. Williamson" <gsw@xxxxxxxxxxxxxxxx> · Tue, 31 Oct 2006 20:36:55 -0800

I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not scale well for what you need etc., etc.

But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't go with tesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't follow stemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit of a pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions of rows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this dataset, partitioned into unequal grouings (about 90 total).

HTH (but doubt it for reasons that undoubtedly be made clear ;-)

Greg Williamson
DBA
GlobeXplorer LLC

-----Original Message-----
From:	pgsql-general-owner@xxxxxxxxxxxxxx on behalf of Joshua D. Drake
Sent:	Tue 10/31/2006 7:46 PM
To:	Teodor Sigaev
Cc:	Darcy Buskermolen; PgSQL General; PostgreSQL-development
Subject:	Re: [HACKERS] [GENERAL] Index greater than 8k

Teodor Sigaev wrote:
>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>> sent a self contained test case directly to  Teodor  which shows the
>> error.
>> 'ERROR:  index row requires 8792 bytes, maximum size is 8191'
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will  work good on small pieces of text such as words or
> set expression. But all big texts (on the same language) will be similar
> :(. So, I didn't take care about guarantee that index tuple's size
> limitation. In principle, it's possible to modify pg_trgm to have such
> guarantee, but index becomes lossy - all tuples gotten  from index
> should be checked by table's tuple evaluation.

We are trying to get something faster than ~ '%foo%';

Which Tsearch2 does not give us :)

Joshua D. Drake

> 
> If you want to search similar documents I can recommend to have a look
> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
> pretty close to trigrams and metrics of similarity is the same, but uses
> another signature calculations. And, there are some tips and trics:
> removing HTML marking,removing punctuation, lowercasing text and so on -
> it's interesting and complex task.

-- 

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

-------------------------------------------------------
Click link below if it is SPAM gsw@xxxxxxxxxxxxxxxx
"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=454815f5242304846743324&user=gsw@xxxxxxxxxxxxxxxx&retrain=spam&template=history&history_page=1";
!DSPAM:454815f5242304846743324!
-------------------------------------------------------