Re: Indexes for hashes

"ktm@xxxxxxxx" <ktm@xxxxxxxx> · Wed, 15 Jun 2016 08:03:07 -0500

On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> Hi,
> 
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:
> 
>    - The number of distinct characters (alphabet) is limited to 16
>    - Each string is of the same length, 64 characters
>    - The strings are essentially random
> 
> Creating a B-Tree index on this results in the index size being larger than
> the table itself, and there are disk space constraints.
> 
> I've found the SP-GIST radix tree index, and thought it could be a good
> match for the data because of the above constraints. An attempt to create
> it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more
> than 12 hours (while a similar B-tree index takes a few hours at most), so
> I've interrupted it because "it probably is not going to finish in a
> reasonable time". Some slides I found on the spgist index allude that both
> build time and size are not really suitable for this purpose.
> 
> My question is: what would be the most size-efficient index for this
> situation?

Hi Ivan,

If the strings are really random, then maybe a function index on the first
4, 8, or 16 characters could be used to narrow the search space and not need
to index all 64. If they are not "good" random numbers, you could use a hash
index on the strings. It will be much smaller since it currently uses a 32-bit
hash. It has a number of caveats and is not currently crash-safe, but it seems
like it might work in your environment. You can also use a functional index on
a hash-function applied to your values with a btree to give you crash safety.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance