Re: bit correctness and checksumming

james@xxxxxxxxxxxx · Wed, 16 Oct 2013 17:53:35 +0100

Does Ceph log anywhere corrected(/caught) silent corruption - would be 
interesting to know how much a problem this is, in a large scale 
deployment.  Something to gather in the league table mentioned at the 
London Ceph day?

Just thinking out-loud (please shout me down...) - if the FS itself 
performs it's own ECC, ATA streaming command set might be of use to 
avoid performance degradation due to drive level recovery at all.

On 2013-10-16 17:12, Sage Weil wrote:
On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
Hi all,
There has been some confusion the past couple days at the CHEP
conference during conversations about Ceph and protection from bit 
flips
or other subtle data corruption. Can someone please summarise the
current state of data integrity protection in Ceph, assuming we have 
an
XFS backend filesystem? ie. don't rely on the protection offered by
btrfs. I saw in the docs that wire messages and journal writes are
CRC'd, but nothing explicit about the objects themselves.

- Everything that passes over the wire is checksummed (crc32c).  This 
is
mainly because the TCP checksum is so weak.

- The journal entries have a crc.

- During deep scrub, we read the objects and metadata, calculate a 
crc32c,
and compare across replicas.  This detects missing objects, bitrot,
failing disks, or anything other source of inconistency.

- Ceph does not calculate and store a per-object checksum.  Doing so 
is
difficult because rados allows arbitrary overwrites of parts of an 
object.

- Ceph *does* have a new opportunistic checksum feature, which is
currently only enabled in QA.  It calculates and stores checksums on
whatever block size you configure (e.g., 64k) if/when we 
write/overwrite a
complete block, and will verify any complete block read against the 
stored
crc, if one happens to be available.  This can help catch some but 
not all
sources of corruption.

We also have some specific questions:

1. Is an object checksum stored on the OSD somewhere? Is this in 
user.ceph._, because it wasn't obvious when looking at the code?

No (except for the new/experimental opportunistic crc I mention 
above).

2. When is the checksum verified. Surely it is checked during the 
deep scrub, but what about during an object read?

For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc 
and
verifies it.

2b. Can a user read corrupted data if the master replica has a bit 
flip but this hasn't yet been found by a deep scrub?

Yes.

3. During deep scrub of an object with 2 replicas, suppose the 
checksum is different for the two objects -- which object wins? (I.e. 
if you store the checksum locally, this is trivial since the 
consistency of objects can be evaluated locally. Without the local 
checksum, you can have conflicts.)

In this case we normally choose the primary.  The repair has to be
explicitly triggered by the admin, however, and there are some 
options to
control that choice.

4. If the checksum is already stored per object in the OSD, is this 
retrievable by librados? We have some applications which also need to 
know the checksum of the data and this would be handy if it was 
already calculated by Ceph.

It would!  It may be that the way to get there is to build and API to
expose the opportunistic checksums, and/or to extend that feature to
maintain full checksums (by re-reading partially overwritten blocks 
on
write).  (Note, however, that even this wouldn't cover xattrs and 
omap
content; really this is something that "should" be handled by the 
backend
storage/file system.)

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com