Search Postgresql Archives

Re: Hash Function: MD5 or other?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 14 Jun 2005, at 14:33, Peter Fein wrote:

Alex Stapleton wrote:


On 13 Jun 2005, at 23:49, Peter Fein wrote:



Hi-

I wanted to use a partially unique index (dependent on a flag) on a TEXT column, but the index row size was too big for btrees. See the thread "index row size 2728 exceeds btree maximum, 2713" from the beginning of this month for someone with a similar problem. In it, someone suggests
indexing on a hash of the text.  I'm fine with this, as the texts in
question are similar enough to each other to make collisions unlikely
and a collision won't really cause any serious problems.

My question is: is the builtin MD5 appropriate for this use or should I be using a function from pl/something? Figures on collision rates would
be nice as well - the typical chunk of text is probably 1k-8k.

Thanks!



As others have said MD5 isn't the fastest one out there. However no
cryptographically secure hashes are really that fast. However you can probably get away with using a CRC hash which is long enough to reduce
your chances of collision a lot. However, PostgreSQL doesn't  have a
built in CRC function, which is a bit of a pain unless your prepared to
implement one, or use pl/* to do it, which sounds like  overkill. I
suggest you run some benchmarks on MD5 and see if it's fast enough to
meet your current (and perhaps future) needs.

You could of course, just use a hash index on your text field! I think that would probably cope with larger data sets OK. It has the advantage of handling collisions for you as well :) However it means you have to transfer large amounts of data around, so if network speed ever becomes
a limitation, MD5 hashing (or enabling compression  on your PgSQL
connection) may help.


Let me clarify my problem:

I've got a bunch of texts, stored in column sometext. These texts are
assigned to groups, group_id. Each group may contain rows with identical sometext. Finding the appropriate group when inserting is very expensive
and is calculated on the client side. I want to mark *one* of each of
the exact duplicates in a group with a flag, group_representative.
Basically, a partial index on:

(group_id, sometext) WHERE group_representative=TRUE.

The motivation is then to be able to do:

SELECT * FROM mytable WHERE <some other conditions> AND
group_representative=TRUE

instead of:

SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
(group_id, sometext)

since there are more of these queries than inserts (and they return a
lot more rows than belong to a single group_id). I suspect that running a DISTINCT in this case would be more expensive than updating the index
when inserting.

Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as btrees? I
thought hash indexes were slow...

Hope that's clearer...

From the manual.

Testing has shown PostgreSQL's hash indexes to perform no better than B-tree indexes, and the index size and build time for hash indexes is much worse. For these reasons, hash index use is presently discouraged.

So yes, they are slower, but not by an enormous amount. (The lack of any data to backup the manuals conclusions worries me, but nevermind.) It is certainly worth trying.

The only hashing function PG has built in which you can actually access is MD5. You might be able to get better performance from that if data transmission takes a long time as you will only need to send the hash to the database. You could then build a B-Tree index on the MD5 hashes, which would certainly do what you want it to as the likelihood of a collision is low. So basically it all boils down to just how slow MD5 is, and just how fast your network connection to your DB server(s) is.

--Pete

--
Peter Fein pfein@xxxxxxxxx 773-575-0694

Basically, if you're not a utopianist, you're a schmuck. -J. Feldman

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster




---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux