Search Postgresql Archives

Re: cyclical redundancy checksum algorithm(s)?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Karen Hill wrote:
Tom Lane wrote:
"Karen Hill" <karen_hill22@xxxxxxxxx> writes:
Ralph Kimball states that this is a way to check for changes.  You just
have an extra column for the crc checksum.  When you go to update data,
generate a crc checksum and compare it to the one in the crc column.
If they are same, your data has not changed.
You sure that's actually what he said?  A change in CRC proves the data
changed, but lack of a change does not prove it didn't.


On page 100 in the book, "The Data Warehouse Toolkit" Second Edition,
Ralph Kimball writes the following:

"Rather than checking each field to see if something has changed, we
instead compute a checksum for the entire row all at once.  A cyclic
redundancy checksum (CRC) algorithm helps us quickly recognize that a
wide messy row has changed without looking at each of its constituent
fields."

On page 360 he writes:

"To quickly determine if rows have changed, we rely on a cyclic
redundancy checksum (CRC) algorithm.   If the CRC is identical for the
extracted record and the most recent row in the master table, then we
ignore the extracted record.  We don't need to check every column to be
certain that the two rows match exactly."

People do sometimes use this logic in connection with much wider
"summary" functions, such as an MD5 hash.  I wouldn't trust it at all
with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
collision.


An MD5 hash value has 128 bits; an SHA1 hash value has 160 bits. Roughly speaking, a 32-bit check sum gives you a 1 in 4 billion chance that a change won't be detected; a 64-bit check sum gives you a 1 in 16 billion billion chance; and a 128-bit check sum therefore a 1 in 256 billion billion billion billion chance. Every hash is subject to collisions - there are an indefinitely large number of possible input values (of widely differing lengths) that produce the same output. However, the chance of seeing two such values is very improbable.

And, don't forget, if you update a row with a few changes (so some bytes change but most do not), the chance that the new row produces the same checksum as the old row is very small with a well-designed checksum. Most updates will produce small changes in the data, but big changes in the checksum.

The other issue is how long does it take to compute the checksum compared with doing a field-by-field check? There are many facets to that answer - related to caching of old values and comparison with the new.

--
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler@xxxxxxxxxxxxx, jleffler@xxxxxxxxxx
Guardian of DBD::Informix v2005.02 -- http://dbi.perl.org/


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux