Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

George Neuner <gneuner2@xxxxxxxxxxx> · Sun, 15 Dec 2019 16:59:30 -0500

On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@xxxxxxxxx>
wrote:

>On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
>> I am looking for stable hash functions producing 8-byte or 4-byte hashes 
>> from long text values in Postgres 10 or later.
>>
>> There is md5(), the result of which can be cast to uuid. This reliably 
>> produces practically unique, stable 16-byte values. I have usecases where 
>> an 8-byte or even 4-byte hash would be good enough to make collisions 
>> reasonably unlikely. (I can recheck on the full string) - and expression 
>> indexes substantially smaller. I could truncate md5 and cast back and 
>> forth, but that seems like a lot of wasted computation. Are there 
>> suggestions for text hash functions that are
>> - fast
>> - keep collisions to a minimum
>> - stable across major Postgres versions (so expression indexes don't break)
>> - croptographic aspect is not needed (acceptable, but no benefit)
>
>What about a CRC32 function?  It's fast, and an SSE4 instruction has been in 
>Intel CPUs for about 10 years.

On long text CRC will not be as discriminating as a real cryptohash,
but it may be considerably faster to compute depending on the length
of the text.  Given that the OP can check the actual string in the
event of a collision, it may work well enough.

One way to make it more discriminating is to compute a pair of CRCs
using different polynomials.  Unfortunately the SSE4.2 CRC32
instruction uses a fixed polynomial.  You can compute it twice using
different initial values, but the result won't be as good as actually
using different polynomials.

I used to create 4-byte hashes by concatenating two 16-bit CRCs - one
using the X-modem polynomial and the other using the CCITT polynomial.
Since these gave very different results, it worked quite well for my
use case where all the strings were guaranteed < 16KB.

YMMV,
George