Re: Chances of silent errors?

pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Mon, 21 Jan 2013 22:39:22 +0000

> Coming from the zfs world, I've heard a few talk about the
> chances of "silent errors", meaning the checksum on the drives
> match, but the data being bad because of matching checksum
> (aka collisions). [ ... ]

That's a very narrow definition of "silent errors", they happen
in any case where incorrect data has been written to persistent
storage from memory, and yet no error has been signaled.

A common cause of those is software or firmware (HBA, disk, ...)
bugs, that either read or write the wrong blocks or modify them
in transit.

The classic report on this is from CERN's extensive testing:

  http://w3.hepix.org/storage/hep_pdf/2007/Spring/kelemen-2007-HEPiX-Silent_Corruptions.pdf

As to checksum collisions, that depends a bit on sector size and
the type/length of checksum and "enterprise" drives can usually
be formatted with different size sectors to accomodate different
size checksums. I would also suspect that it is far more likely
that very different blocks on the same disk have legitimately
the same checksum than a slightly corrupted block gets the same
checksum as the uncorrupted one...

For some context the details of the very informative SAVVIO
product manual here, page 15, the "Miscorrected Data" line:

  http://www.seagate.com/internal-hard-drives/enterprise-hard-drives/hdd/savvio-15k/

or also, page 43, the section "Protection Information".

But note that the URE is the *Unrecovered* Error Rate, that is
for errors that have been detected but not corrected, not the
*Undetected* Error Rate.

As someone famously said, as far as he knew his datacenter never
had an undetected error.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html