Re: Hash Function: MD5 or other?

Alex Stapleton <alexs@xxxxxxxxx> · Tue, 14 Jun 2005 14:55:45 +0100

On 14 Jun 2005, at 14:33, Peter Fein wrote:

Alex Stapleton wrote:

On 13 Jun 2005, at 23:49, Peter Fein wrote:

Hi-

I wanted to use a partially unique index (dependent on a flag) on  
a  TEXT
column, but the index row size was too big for btrees.  See the  
thread
"index row size 2728 exceeds btree maximum, 2713" from the   
beginning of
this month for someone with a similar problem.  In it, someone   
suggests
indexing on a hash of the text.  I'm fine with this, as the texts in
question are similar enough to each other to make collisions  
unlikely
and a collision won't really cause any serious problems.

My question is: is the builtin MD5 appropriate for this use or   
should I
be using a function from pl/something?  Figures on collision  
rates  would
be nice as well - the typical chunk of text is probably 1k-8k.

Thanks!

As others have said MD5 isn't the fastest one out there. However no
cryptographically  secure hashes are really that fast. However  
you  can
probably get away with using a CRC hash which is long enough to   
reduce
your chances of collision a lot. However, PostgreSQL doesn't  have a
built in CRC function, which is a bit of a pain unless your   
prepared to
implement one, or use pl/* to do it, which sounds like  overkill. I
suggest you run some benchmarks on MD5 and see if it's  fast  
enough to
meet your current (and perhaps future) needs.

You could of course, just use a hash index on your text field! I   
think
that would probably cope with larger data sets OK. It has the   
advantage
of handling collisions for you as well :) However it means  you  
have to
transfer large amounts of data around, so if network  speed ever  
becomes
a limitation, MD5 hashing (or enabling compression  on your PgSQL
connection) may help.

Let me clarify my problem:

I've got a bunch of texts, stored in column sometext. These texts are
assigned to groups, group_id. Each group may contain rows with  
identical
sometext. Finding the appropriate group when inserting is very  
expensive
and is calculated on the client side. I want to mark *one* of each of
the exact duplicates in a group with a flag, group_representative.
Basically, a partial index on:

(group_id, sometext) WHERE group_representative=TRUE.

The motivation is then to be able to do:

SELECT * FROM mytable WHERE <some other conditions> AND
group_representative=TRUE

instead of:

SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
(group_id, sometext)

since there are more of these queries than inserts (and they return a
lot more rows than belong to a single group_id). I suspect that  
running
a DISTINCT in this case would be more expensive than updating the  
index
when inserting.

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the  
chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as  
btrees? I
thought hash indexes were slow...

Hope that's clearer...

From the manual.

Testing has shown PostgreSQL's hash indexes to perform no better than  
B-tree indexes, and the index size and build time for hash indexes is  
much worse. For these reasons, hash index use is presently discouraged.

So yes, they are slower, but not by an enormous amount. (The lack of  
any data to backup the manuals conclusions worries me, but  
nevermind.) It is certainly worth trying.

The only hashing function PG has built in which you can actually  
access is MD5. You might be able to get better performance from that  
if data transmission takes a long time as you will only need to send  
the hash to the database. You could then build a B-Tree index on the  
MD5 hashes, which would certainly do what you want it to as the  
likelihood of a collision is low. So basically it all boils down to  
just how slow MD5 is, and just how fast your network connection to  
your DB server(s) is.

--Pete

--
Peter Fein                 pfein@xxxxxxxxx                  
773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

---------------------------(end of  
broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq