Question regarding fast-hashing in PGSQL

Stephen Conley <cheetah@xxxxxxxxxx> · Wed, 18 Sep 2019 12:41:28 -0400

Hey there;

I have a weird use case where I am basically taking data from many different sources and merging it into a single table, while trying to avoid duplicates as much as possible.  None of them share any kind of primary key, but I have determined 3 columns that, together, will almost always be unique so I am planning on using those 3 columns as a composite primary key.

Two of those columns are integers, which is great.  The third column is a string, UTF-8, which may be quite long (though probably no longer than 50 characters ... on average probably around 10 - 30 characters).  The strings could be practically anything, and they absolutely will not be unique on their own (these three data points are basically x, y coordinates and then some name...for a given x,y coordinate there may be multiple names, but the likihood of the same name at the same x, y is almost 0)

I really don't want to do a string comparison if possible because this DB will be getting very large, very quickly -- 20 million or so rows anticipated in the near future (i.e. next few weeks), with possible growth up to around 200 million (1+ year later).

My idea was to hash the string to a bigint, because the likelihood of all 3 columns colliding is almost 0, and if a duplicate does crop up, it isn't the end of the world.

However, Postgresql doesn't seem to have any 'native' hashing calls that result in a bigint.  The closest I've found is pgcrypto's 'digest' call -- I could theoretically take an md5 hash, and just use the first 8 bytes of it to make a bigint.

HOWEVER... there is no straight forward way to do this.  The most straight forward way I've seen is md5 -> hex string -> substring -> bigint.  This is ridiculous to me -- I'm basically converting binary to text, then converting text to binary.  However, you can't convert a bytea to a bigint in any fashion that I can tell so I have to eat a bunch of overhead for fun.

What would be the fastest way to do this?  I will be generating potentially a LOT of these keys so I want to do it the least dumb way.   I am using Digital Ocean's hosted PostgreSQL so I can't use my own C code -- but I can use PL/Psql, PL/Perl or any of these extensions:

https://www.digitalocean.com/docs/databases/postgresql/resources/supported-extensions/

If my concerns about string comparisons are unfounded and I'm working way too hard to avoid something that doesn't matter ... feel free to tell me that as well.  Basically, PostgreSQL performance guys, how would you tackle this one?

Thanks,

Stephen