Re: How to hash a large amount of data within Postgres?

"Peter J. Holzer" <hjp-pgsql@xxxxxx> · Wed, 23 Jun 2021 19:39:25 +0200

On 2021-06-21 15:53:09 +0200, Thorsten Schöning wrote:
> Some years ago I implemented some SQL to read all files, build a table
> of SHA256 hashes and tell me how much data is redundant. The goal was
> to have a look at which files share the same hash with different LOIDs
> and optionally change that, so that all those files are only stored
> once on the end.
> 
> While the approach was pretty naive, because it simply read all files
> into memory to calculate the hashes, I'm somewhat sure it worked in
> the past with Postgres 9.6. The executing server had enough free RAM
> available as well to process the at most ~4 GiB large files one after
> another.
> 
> I tried that SQL today with Postgres 11 on UB 18.04 and it failed:
> 
> > [Code: 0, SQL State: XX000]  FEHLER: invalid memory alloc request size 1898107949
> >   Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...]
[...]
> I searched regaridng that issue and only found two relevant results:
> Corrupted rows for some reason and simply size restrictions when
> allocating memory. The latter is more likely than the former in my
> case, as the restrictions seems to be 1 GiB and I do have larger
> files.

1 GB is the maximum size of quite a few data structures in PostgreSQL. I
don't use PL/pgSQL, but I wouldn't be surprised if it was the maximum
size of whatever loread() returns (a bytea?). I would be surprised if
this limit was higher in version 9.6 than it is in version 11, however.

> I'm doing the following simply currently, because I didn't find any
> interfaces allowing to forward blocks of data, LOIDs, file descriptors
> or anything like that working with smaller buffers or alike.
> 
> > fd      := lo_open( loid,  INV_READ);
> > size    := lo_lseek(fd, 0, SEEK_END);
> > PERFORM    lo_lseek(fd, 0, SEEK_SET);
> 
> > hashBin := digest(loread(fd, size), algorithm);
> > hashHex := encode(hashBin,          'hex');
> 
> So, is there any way to work around the problem I have currently?

Normally, hash libararies have a way to feed chunks of data into a hash
computations to avoid having to keep the whole thing in RAM.
The pgcrypto extension seems to be lacking such functionality, however.

I would build something similar to a Merkle tree:

Choose a convenient chunk size (a few MB is probably ok), read the large
object in chunks of this size, computing the hash for each. Concatenate
all the hashes and compute the hash of that. Add intermediate levels if
the the concatenated hashes are still too large to fit in memory.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@xxxxxx         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"
Attachment:
signature.asc

Description: PGP signature