On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@xxxxxxxxx> wrote: >On 12/10/19 3:11 PM, Erwin Brandstetter wrote: >> I am looking for stable hash functions producing 8-byte or 4-byte hashes >> from long text values in Postgres 10 or later. >> >> There is md5(), the result of which can be cast to uuid. This reliably >> produces practically unique, stable 16-byte values. I have usecases where >> an 8-byte or even 4-byte hash would be good enough to make collisions >> reasonably unlikely. (I can recheck on the full string) - and expression >> indexes substantially smaller. I could truncate md5 and cast back and >> forth, but that seems like a lot of wasted computation. Are there >> suggestions for text hash functions that are >> - fast >> - keep collisions to a minimum >> - stable across major Postgres versions (so expression indexes don't break) >> - croptographic aspect is not needed (acceptable, but no benefit) > >What about a CRC32 function? It's fast, and an SSE4 instruction has been in >Intel CPUs for about 10 years. On long text CRC will not be as discriminating as a real cryptohash, but it may be considerably faster to compute depending on the length of the text. Given that the OP can check the actual string in the event of a collision, it may work well enough. One way to make it more discriminating is to compute a pair of CRCs using different polynomials. Unfortunately the SSE4.2 CRC32 instruction uses a fixed polynomial. You can compute it twice using different initial values, but the result won't be as good as actually using different polynomials. I used to create 4-byte hashes by concatenating two 16-bit CRCs - one using the X-modem polynomial and the other using the CCITT polynomial. Since these gave very different results, it worked quite well for my use case where all the strings were guaranteed < 16KB. YMMV, George