Re: [HACKERS] Index greater than 8k

Oleg Bartunov <oleg@xxxxxxxxxx> · Fri, 3 Nov 2006 11:54:05 +0300 (MSK)

Gregory,

All you described you do with fti is possible with tsearch2. Just need
some think, of course. If you don't need stemming, just don't use it,
if you need something like %txt%, just write simple dictionary, which 
produce any substrings from input word.

Oleg
On Tue, 31 Oct 2006, Gregory S. Williamson wrote:

I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not scale well for what you need etc., etc.

But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't go with tesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't follow stemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit of a pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions of rows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this dataset, partitioned into unequal grouings (about 90 total).

HTH (but doubt it for reasons that undoubtedly be made clear ;-)

Greg Williamson
DBA
GlobeXplorer LLC

-----Original Message-----
From:	pgsql-general-owner@xxxxxxxxxxxxxx on behalf of Joshua D. Drake
Sent:	Tue 10/31/2006 7:46 PM
To:	Teodor Sigaev
Cc:	Darcy Buskermolen; PgSQL General; PostgreSQL-development
Subject:	Re: [HACKERS] [GENERAL] Index greater than 8k

Teodor Sigaev wrote:
The problem as I remember it is pg_tgrm not tsearch2 directly, I've
sent a self contained test case directly to  Teodor  which shows the
error.
'ERROR:  index row requires 8792 bytes, maximum size is 8191'
Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
pg_trgm is designed to find similar words and use technique known as
trigrams. This will  work good on small pieces of text such as words or
set expression. But all big texts (on the same language) will be similar
:(. So, I didn't take care about guarantee that index tuple's size
limitation. In principle, it's possible to modify pg_trgm to have such
guarantee, but index becomes lossy - all tuples gotten  from index
should be checked by table's tuple evaluation.

We are trying to get something faster than ~ '%foo%';

Which Tsearch2 does not give us :)

Joshua D. Drake

If you want to search similar documents I can recommend to have a look
to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
pretty close to trigrams and metrics of similarity is the same, but uses
another signature calculations. And, there are some tips and trics:
removing HTML marking,removing punctuation, lowercasing text and so on -
it's interesting and complex task.

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83