On Mon, Apr 8, 2024 at 10:08 AM Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote:
- Is there a way to tell postgresql "this column contains cryptographic hashes, so you can do hash joins using any subset of the bits, without having to hash them again"? If not, should there be?
No, and no. (if I understand your question correctly). You could use a functional index, I suppose, but seems premature optimization.
- Is `bit(256)` the right type to use to store 32-byte hash values with no additional overhead?
No, you would want bytea. I would store the value in a TEXT field, unless you really worried about space savings. The hexadecimal value will be far easier to debug and work with, and you can use a simple b-tree index.
- What would be the simplest way, given an input array of hashes (which
I may have to pass in as an array and use `unnest`), to filter out all
the values that already exist, *and* generate a corresponding bitmap
in the same order for present/not-present for the entire array (to
send back to the user)? Filtering seems easy enough, but generating
the bitmap less so.
Something like this:
SELECT array_agg(case when t.bhash is null then 1 else 0 end)
from unnest(array['blakehash1', 'blakehash2', etc...]) as a(x)
left join mytable t on t.bhash = a.x;
- Does it make more sense to store the values as one row per value, or
as one row per group of values?
Hard to answer without knowing more, but I'd lean towards simple and one row per value.
Your proposal (query db, do external work, update db) also sets of lots of concurrency red flags, so be mindful of that.
Cheers,
Greg