Re: md5(large_object_id)

"Daniel Verite" <daniel@xxxxxxxxxxxxxxxx> · Tue, 06 Oct 2015 14:32:19 +0200

	Karsten Hilbert wrote:

> On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote:
> 
> > Karsten Hilbert <Karsten.Hilbert@xxxxxxx> wrote:
> > 
> > > I am dealing with radiology studies aka DICOM data) one would
> > > want an md5 function which streams in parts of a large object
> > > piece by piece using md5_update and m5_finalize or some such.
> > It would certainly be possible to write a lo_md5(oid) function to do
> > this, but as far as I'm aware nobody has yet done so.  How are your
> > C skills?
> 
> Next to none, unfortunately :-/

It would be nice to have this API exposed in pgcrypto,
but as a workaround, you could use plperl instead, or a
similarly "easy" PL that provides it.

Here's an example with plperlu, using the core
Digest module. It works with other hash types, not just MD5.
See http://perldoc.perl.org/Digest.html for the exact names
(warning: 'SHA-1' instead of pgcrypto's 'SHA1').

 CREATE OR REPLACE FUNCTION lo_digest(text, oid, int)
 RETURNS bytea AS
$$
  use Digest;
  use strict;
  use Data::Dumper;
  my $ctxt = Digest->new($_[0]);
  my $sz=$_[2];
  elog(ERROR, "Invalid chunk size: $sz") if ($sz<=0);
  my $sth = spi_query("SELECT lo_open($_[1], 262144) as fd");
  my $row = spi_fetchrow($sth);
  spi_cursor_close($sth); 

  if ($row) {
     my $fd = $row->{fd};
     my $bytes;
     my $plan = spi_prepare("SELECT loread($fd, $sz) as chunk");
     do {
       $sth = spi_query_prepared($plan);
       $row = spi_fetchrow($sth);
       $bytes = decode_bytea($row->{chunk});
       $ctxt->add($bytes);
       spi_cursor_close($sth);
     } while (length($bytes)>0);
     spi_exec_query("select lo_close($fd)");
     spi_freeplan($plan);
  }
  return encode_bytea($ctxt->digest);
$$ LANGUAGE plperlu;

Example of use:
# select lo_digest('MD5', 2557608, 2048*256);

	     lo_digest		    
------------------------------------
 \xa8447e145d0f8d9ca7fe7df1bbf06d75

2557608 is the oid and 2048*256 represents 512KB.
The lo blocksize is typically 2048, so a multiple of 2048 is ideal
to optimize reads.

I expect it to work on very large contents while requesting only modest
amounts of memory. But unfortunately is seems pretty slow.

On my desktop core i5-3470, it takes 2.5 seconds to digest
a 100 MB audio file already in cache. An independant Perl script
doing the equiavlent processing on the same data takes 0.4s .

It's not because of Digest::MD5 or loread(), it seems that it's the
pipe in-between with the text->bytea decoding that eats most of the
CPU cycles, that step being necessary because plperl lacks the ability
to consume bytea directly. Sigh...

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general