Search Postgresql Archives

Re: Storing and comparing columns of cryptographic hashes?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Apr 8, 2024 at 10:08 AM Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote:


- Is there a way to tell postgresql "this column contains cryptographic
  hashes, so you can do hash joins using any subset of the bits, without
  having to hash them again"? If not, should there be?
 
if you know the specific subset of the bits from the hash value ahead of time, you can create an index on a  function to extract the subset of bits ahead of time by putting them in an index.  Consider

create table dd2 (id integer, hash_b bytea);  --create example table
insert into dd2 values (1, '/x1234568'::bytea);  --throw a record at it
create index on dd2 (substr ( hash_b,1,5));  -- create btree index on subset of value
select * from dd2 WHERE substr(hash_b,1,5) = '/x123';  --this query will use the index once there enough data to justify an index scan
 
- Is `bit(256)` the right type to use to store 32-byte hash values with
  no additional overhead?

Keep them in the bytea format,  can easily cast from bytea to hex for debugging purposes.

select encode(hash_b, 'hex') from dd2

The Larger the data size the slower all operations are best to keep them in same data type instead of casting in and out different data types 
 
- What would be the simplest way, given an input array of hashes (which
  I may have to pass in as an array and use `unnest`), to filter out all
  the values that already exist, *and* generate a corresponding bitmap
  in the same order for present/not-present for the entire array (to
  send back to the user)? Filtering seems easy enough, but generating
  the bitmap less so.

If you have unique index on the bytea hash,  you can unnest the array to a table, then insert values with ON CONFLICT   

CREATE unique index on dd2(hash_b);
INSERT into dd2 (hash_b) 
    (SELECT * from unnest(array['/x1234568','/x12345','/x12346', '/x12347', '/x12348' ]::bytea[]))
    ON CONFLICT (hash_b) DO NOTHING
RETURNING hash_b  -- this will  return the rows that got inserted

Then you can compare the inserted vs the inputted array to know what is present vs not present.  
 
- Does it make more sense to store the values as one row per value, or
  as one row per group of values? I know that postgresql can store an
  entire array in one column; could that efficiently support operations
  like "tell me which of these objects don't exist in any array in this
  column" or "for all of these objects, tell me all the group-id values
  for rows containing them"?

Putting them into arrays to keep them grouped together adds additional overhead to unnest for searching or join operations.  If the data is going to be large and have to scan over large data sets with  arrays containing many hash values thats allot of data processing.  Arrays can be indexed but have to use GIN index, which has many drawbacks compared to btree.  None of the above queries are possible with GIN indexes or using array columns without a lot more code.   

Arrays are not data sets if  the design needs to access a specific  hash value for update,delete, append new values, an array probably not the best solution.     
Hope this helps 
Justin 

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux