Re: cyclical redundancy checksum algorithm(s)?

Volker Hetzer <firstname.lastname@xxxxxxxx> · Thu, 28 Sep 2006 19:33:51 +0200

Karen Hill schrieb:
I just finished reading one of Ralph Kimball's books.  In it he
mentions something called a cyclical redundancy checksum (crc)
function.  A crc function is a hash function that generates a checksum.

I am wondering a few things.  A crc function would be extremely useful
and time saving in determining if a row needs to be updated or not (are
the values the same, if yes don't update, if not update).  In fact
Ralph Kimball states that this is a way to check for changes.  You just
have an extra column for the crc checksum.  When you go to update data,
generate a crc checksum and compare it to the one in the crc column.
If they are same, your data has not changed.

Yet what happens if there is a collision of the checksum for a row?
It's simple:
If a crc indicates a difference, you can rely on the answer. If a crc
indicates equality, you have to do a real comparison.

Therefore a CRC can give you a large speedup, depending on the situation
and CRC size. Small CRCs (like 2 or 3 bit) are very fast to compare but the
false positive percentage is high, necessitating lots of full comparisons.
crc's almost as long as the data sets themselves are almost as slow as
exhaustive comparison but ere almost always correct.

Given a certain average data record size, hash size, number of data
records and relative frequency of comparison versus data change (i.e. crc
calculation) you can figure out how much time you save when using crc's.

But in most circumstances you don't have to bother because...

Is there a crc function in postgresql?  If not what algorithm would I
need to use to create one in pl/pgsql?
...normally a simple 32bit crc is amply sufficient.
Given 2^16 data records, about /one/ pair out of them will share one crc,
and, given one data record, /one/ other out of 2^32 data records will have
the same crc.
They are fast to compute and can be indexed as integer. That's what I did
when I once had to implement a simple disk mirroring application using mysql
for the comparisons.
This compared several million file names (daily) using the method indicated
above and worked flawlessly and fast.

Lots of Greetings!
Volker
--
For email replies, please substitute the obvious.