Re: index type for indexing long texts

Tom Lane <tgl@xxxxxxxxxxxxx> · Sat, 13 Jan 2007 11:45:09 -0500

Aleksander Kmetec <aleksander.kmetec@xxxxxxxxx> writes:
> I'm looking for a solution for indexing long TEXT columns. We're currently using a HASH index, which can handle most 
> situations, but every now and then we need support for even longer texts.

> One solution would be to create a functional index which would only use the first N chars of mycol, but then we'd have 
> to change several hundred occurences of "mycol = someval" with "(mycol = someval AND firstN(mycol) = firstN(someval))", 
> as well as update some SQL generators...

> That's why I'd be interested to know if there are any index types available which store only the first N chars or use 
> some highly compressed form for storing index data, and then recheck any potential hits against the main table. And if 
> something like that does not exist yet, how difficult would it be to construct such a solution out of many "spare parts" 
> that come with PG?

I think you could do it with GiST.  Look at contrib/btree_gist, and
make a variant version of its text support in which only the first N
characters are stored/compared, then declare all the operators as
RECHECK.  (Actually, I'm not sure that "<" etc would work correctly in
such a scenario, especially in non-C locales; it might be best to
declare the operator class as containing only "=".)

I don't think you can do it in btree or hash because they assume that
the operators are strictly consistent with the support functions, and
in fact apply the operators directly in some code paths.  So you
couldn't use true text equality as the "=" operator, which ruins your
chances of not changing your queries.  But GiST never touches the
operators at all, just the support procs, and you can make the support
procs do anything.  So it should be possible to make the index work with
just the first N characters, and then the RECHECK with true equality
would filter out the wrong matches.

BTW, if you get something that works well, stick it up on pgfoundry;
you're not the first to have asked this ...

			regards, tom lane