Karen Hill wrote:
Tom Lane wrote:
"Karen Hill" <karen_hill22@xxxxxxxxx> writes:
Ralph Kimball states that this is a way to check for changes. You just
have an extra column for the crc checksum. When you go to update data,
generate a crc checksum and compare it to the one in the crc column.
If they are same, your data has not changed.
You sure that's actually what he said? A change in CRC proves the data
changed, but lack of a change does not prove it didn't.
On page 100 in the book, "The Data Warehouse Toolkit" Second Edition,
Ralph Kimball writes the following:
"Rather than checking each field to see if something has changed, we
instead compute a checksum for the entire row all at once. A cyclic
redundancy checksum (CRC) algorithm helps us quickly recognize that a
wide messy row has changed without looking at each of its constituent
fields."
On page 360 he writes:
"To quickly determine if rows have changed, we rely on a cyclic
redundancy checksum (CRC) algorithm. If the CRC is identical for the
extracted record and the most recent row in the master table, then we
ignore the extracted record. We don't need to check every column to be
certain that the two rows match exactly."
People do sometimes use this logic in connection with much wider
"summary" functions, such as an MD5 hash. I wouldn't trust it at all
with a 32-bit CRC, and not much with a 64-bit CRC. Too much risk of
collision.
An MD5 hash value has 128 bits; an SHA1 hash value has 160 bits.
Roughly speaking, a 32-bit check sum gives you a 1 in 4 billion chance
that a change won't be detected; a 64-bit check sum gives you a 1 in 16
billion billion chance; and a 128-bit check sum therefore a 1 in 256
billion billion billion billion chance. Every hash is subject to
collisions - there are an indefinitely large number of possible input
values (of widely differing lengths) that produce the same output.
However, the chance of seeing two such values is very improbable.
And, don't forget, if you update a row with a few changes (so some bytes
change but most do not), the chance that the new row produces the same
checksum as the old row is very small with a well-designed checksum.
Most updates will produce small changes in the data, but big changes in
the checksum.
The other issue is how long does it take to compute the checksum
compared with doing a field-by-field check? There are many facets to
that answer - related to caching of old values and comparison with the new.
--
Jonathan Leffler #include <disclaimer.h>
Email: jleffler@xxxxxxxxxxxxx, jleffler@xxxxxxxxxx
Guardian of DBD::Informix v2005.02 -- http://dbi.perl.org/