On 14 Jun 2005, at 14:33, Peter Fein wrote:
Alex Stapleton wrote:
On 13 Jun 2005, at 23:49, Peter Fein wrote:
Hi-
I wanted to use a partially unique index (dependent on a flag) on
a TEXT
column, but the index row size was too big for btrees. See the
thread
"index row size 2728 exceeds btree maximum, 2713" from the
beginning of
this month for someone with a similar problem. In it, someone
suggests
indexing on a hash of the text. I'm fine with this, as the texts in
question are similar enough to each other to make collisions
unlikely
and a collision won't really cause any serious problems.
My question is: is the builtin MD5 appropriate for this use or
should I
be using a function from pl/something? Figures on collision
rates would
be nice as well - the typical chunk of text is probably 1k-8k.
Thanks!
As others have said MD5 isn't the fastest one out there. However no
cryptographically secure hashes are really that fast. However
you can
probably get away with using a CRC hash which is long enough to
reduce
your chances of collision a lot. However, PostgreSQL doesn't have a
built in CRC function, which is a bit of a pain unless your
prepared to
implement one, or use pl/* to do it, which sounds like overkill. I
suggest you run some benchmarks on MD5 and see if it's fast
enough to
meet your current (and perhaps future) needs.
You could of course, just use a hash index on your text field! I
think
that would probably cope with larger data sets OK. It has the
advantage
of handling collisions for you as well :) However it means you
have to
transfer large amounts of data around, so if network speed ever
becomes
a limitation, MD5 hashing (or enabling compression on your PgSQL
connection) may help.
Let me clarify my problem:
I've got a bunch of texts, stored in column sometext. These texts are
assigned to groups, group_id. Each group may contain rows with
identical
sometext. Finding the appropriate group when inserting is very
expensive
and is calculated on the client side. I want to mark *one* of each of
the exact duplicates in a group with a flag, group_representative.
Basically, a partial index on:
(group_id, sometext) WHERE group_representative=TRUE.
The motivation is then to be able to do:
SELECT * FROM mytable WHERE <some other conditions> AND
group_representative=TRUE
instead of:
SELECT * FROM mytable WHERE <some other conditions> DISTINCT ON
(group_id, sometext)
since there are more of these queries than inserts (and they return a
lot more rows than belong to a single group_id). I suspect that
running
a DISTINCT in this case would be more expensive than updating the
index
when inserting.
Knowing the specifics of the data I'm putting in sometext, a halfway
decent hash function would make collisions so rare as to make the
chance
insignificant (and collisions wouldn't break anything anyway). Is this
approach reasonable, or should I use a hash index on (group_id,
sometext) - does this suffer from the same size limitation as
btrees? I
thought hash indexes were slow...
Hope that's clearer...
From the manual.
Testing has shown PostgreSQL's hash indexes to perform no better than
B-tree indexes, and the index size and build time for hash indexes is
much worse. For these reasons, hash index use is presently discouraged.
So yes, they are slower, but not by an enormous amount. (The lack of
any data to backup the manuals conclusions worries me, but
nevermind.) It is certainly worth trying.
The only hashing function PG has built in which you can actually
access is MD5. You might be able to get better performance from that
if data transmission takes a long time as you will only need to send
the hash to the database. You could then build a B-Tree index on the
MD5 hashes, which would certainly do what you want it to as the
likelihood of a collision is low. So basically it all boils down to
just how slow MD5 is, and just how fast your network connection to
your DB server(s) is.
--Pete
--
Peter Fein pfein@xxxxxxxxx
773-575-0694
Basically, if you're not a utopianist, you're a schmuck. -J. Feldman
---------------------------(end of
broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq