Re: Hash Function: MD5 or other?

Peter Fein <pfein@xxxxxxxxx> · Tue, 14 Jun 2005 08:33:34 -0500

Alex Stapleton wrote:
> 
> On 13 Jun 2005, at 23:49, Peter Fein wrote:
> 
> 
>> Hi-
>>
>> I wanted to use a partially unique index (dependent on a flag) on a  TEXT
>> column, but the index row size was too big for btrees.  See the thread
>> "index row size 2728 exceeds btree maximum, 2713" from the  beginning of
>> this month for someone with a similar problem.  In it, someone  suggests
>> indexing on a hash of the text.  I'm fine with this, as the texts in
>> question are similar enough to each other to make collisions unlikely
>> and a collision won't really cause any serious problems.
>>
>> My question is: is the builtin MD5 appropriate for this use or  should I
>> be using a function from pl/something?  Figures on collision rates  would
>> be nice as well - the typical chunk of text is probably 1k-8k.
>>
>> Thanks!
>>
> 
> As others have said MD5 isn't the fastest one out there. However no 
> cryptographically  secure hashes are really that fast. However you  can
> probably get away with using a CRC hash which is long enough to  reduce
> your chances of collision a lot. However, PostgreSQL doesn't  have a
> built in CRC function, which is a bit of a pain unless your  prepared to
> implement one, or use pl/* to do it, which sounds like  overkill. I
> suggest you run some benchmarks on MD5 and see if it's  fast enough to
> meet your current (and perhaps future) needs.
> 
> You could of course, just use a hash index on your text field! I  think
> that would probably cope with larger data sets OK. It has the  advantage
> of handling collisions for you as well :) However it means  you have to
> transfer large amounts of data around, so if network  speed ever becomes
> a limitation, MD5 hashing (or enabling compression  on your PgSQL
> connection) may help.

Let me clarify my problem:

I've got a bunch of texts, stored in column sometext. These texts are
assigned to groups, group_id. Each group may contain rows with identical
sometext. Finding the appropriate group when inserting is very expensive
and is calculated on the client side. I want to mark *one* of each of
the exact duplicates in a group with a flag, group_representative.
Basically, a partial index on:

(group_id, sometext) WHERE group_representative=TRUE.

The motivation is then to be able to do:

SELECT * FROM mytable WHERE <some other conditions> AND
group_representative=TRUE

instead of:

SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
(group_id, sometext)

since there are more of these queries than inserts (and they return a
lot more rows than belong to a single group_id). I suspect that running
a DISTINCT in this case would be more expensive than updating the index
when inserting.

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as btrees? I
thought hash indexes were slow...

Hope that's clearer...

--Pete

-- 
Peter Fein                 pfein@xxxxxxxxx                 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster