Re: bit correctness and checksumming

Sage Weil <sage@xxxxxxxxxxx> · Wed, 16 Oct 2013 11:18:15 -0700 (PDT)

On Wed, 16 Oct 2013, james@xxxxxxxxxxxx wrote:
> Does Ceph log anywhere corrected(/caught) silent corruption - would be
> interesting to know how much a problem this is, in a large scale deployment.
> Something to gather in the league table mentioned at the London Ceph day?

It is logged, and causes the 'ceph health' check to complain. There are 
not currently any historical counts on how many inconsistencies have been 
found and subsequently repaired, though; this would be interested to 
collect and report!

> Just thinking out-loud (please shout me down...) - if the FS itself performs
> it's own ECC, ATA streaming command set might be of use to avoid performance
> degradation due to drive level recovery at all.

Maybe, I'm not familiar... 

sage

> 
> 
> On 2013-10-16 17:12, Sage Weil wrote:
> > On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
> > > Hi all,
> > > There has been some confusion the past couple days at the CHEP
> > > conference during conversations about Ceph and protection from bit flips
> > > or other subtle data corruption. Can someone please summarise the
> > > current state of data integrity protection in Ceph, assuming we have an
> > > XFS backend filesystem? ie. don't rely on the protection offered by
> > > btrfs. I saw in the docs that wire messages and journal writes are
> > > CRC'd, but nothing explicit about the objects themselves.
> > 
> > - Everything that passes over the wire is checksummed (crc32c).  This is
> > mainly because the TCP checksum is so weak.
> > 
> > - The journal entries have a crc.
> > 
> > - During deep scrub, we read the objects and metadata, calculate a crc32c,
> > and compare across replicas.  This detects missing objects, bitrot,
> > failing disks, or anything other source of inconistency.
> > 
> > - Ceph does not calculate and store a per-object checksum.  Doing so is
> > difficult because rados allows arbitrary overwrites of parts of an object.
> > 
> > - Ceph *does* have a new opportunistic checksum feature, which is
> > currently only enabled in QA.  It calculates and stores checksums on
> > whatever block size you configure (e.g., 64k) if/when we write/overwrite a
> > complete block, and will verify any complete block read against the stored
> > crc, if one happens to be available.  This can help catch some but not all
> > sources of corruption.
> > 
> > > We also have some specific questions:
> > > 
> > > 1. Is an object checksum stored on the OSD somewhere? Is this in
> > > user.ceph._, because it wasn't obvious when looking at the code?
> > 
> > No (except for the new/experimental opportunistic crc I mention above).
> > 
> > > 2. When is the checksum verified. Surely it is checked during the deep
> > > scrub, but what about during an object read?
> > 
> > For non-btrfs, no crc to verify.  For btrfs, the fs has its own crc and
> > verifies it.
> > 
> > > 2b. Can a user read corrupted data if the master replica has a bit flip
> > > but this hasn't yet been found by a deep scrub?
> > 
> > Yes.
> > 
> > > 3. During deep scrub of an object with 2 replicas, suppose the checksum is
> > > different for the two objects -- which object wins? (I.e. if you store the
> > > checksum locally, this is trivial since the consistency of objects can be
> > > evaluated locally. Without the local checksum, you can have conflicts.)
> > 
> > In this case we normally choose the primary.  The repair has to be
> > explicitly triggered by the admin, however, and there are some options to
> > control that choice.
> > 
> > > 4. If the checksum is already stored per object in the OSD, is this
> > > retrievable by librados? We have some applications which also need to know
> > > the checksum of the data and this would be handy if it was already
> > > calculated by Ceph.
> > 
> > It would!  It may be that the way to get there is to build and API to
> > expose the opportunistic checksums, and/or to extend that feature to
> > maintain full checksums (by re-reading partially overwritten blocks on
> > write).  (Note, however, that even this wouldn't cover xattrs and omap
> > content; really this is something that "should" be handled by the backend
> > storage/file system.)
> > 
> > sage
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com