Re: bit correctness and checksumming

Tim Bell <Tim.Bell@xxxxxxx> · Wed, 16 Oct 2013 18:16:21 +0000

It was long ago and Linux was very different .....

With respect to today, we found quite a few cases of bad RAID cards which had limited ECC checking on their memory, Stuck bits had serious impacts given our data transit volumes :-(

While the root causes we found in the past may be less likely today (as we move towards replicas and away from hardware RAID), keeping in place a background scrubbing and method to identify components which could be potentially causing corruption by external probing and quality checks is very useful.

Tim

> -----Original Message-----
> From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of james@xxxxxxxxxxxx
> Sent: 16 October 2013 20:06
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  bit correctness and checksumming
> 
> Very interesting link.  I don't suppose there is any data available separating 4K and 512-byte sectored drives?
> 
> 
> On 2013-10-16 18:43, Tim Bell wrote:
> > At CERN, we have had cases in the past of silent corruptions. It is
> > good to be able to identify the devices causing them and swap them
> > out.
> >
> > It's an old presentation but the concepts are still relevant today
> > ...
> > http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
> >
> > Tim
> >
> >
> >> -----Original Message-----
> >> From: ceph-users-bounces@xxxxxxxxxxxxxx
> >> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> >> james@xxxxxxxxxxxx
> >> Sent: 16 October 2013 18:54
> >> To: ceph-users@xxxxxxxxxxxxxx
> >> Subject: Re:  bit correctness and checksumming
> >>
> >>
> >> Does Ceph log anywhere corrected(/caught) silent corruption - would
> >> be interesting to know how much a problem this is, in a large scale
> >> deployment.  Something to gather in the league table mentioned at
> >> the London Ceph day?
> >>
> >> Just thinking out-loud (please shout me down...) - if the FS itself
> >> performs it's own ECC, ATA streaming command set might be of use to
> >> avoid performance degradation due to drive level recovery at all.
> >>
> >>
> >> On 2013-10-16 17:12, Sage Weil wrote:
> >> > On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
> >> >> Hi all,
> >> >> There has been some confusion the past couple days at the CHEP
> >> >> conference during conversations about Ceph and protection from
> >> bit
> >> >> flips or other subtle data corruption. Can someone please
> >> summarise
> >> >> the current state of data integrity protection in Ceph, assuming
> >> we
> >> >> have an XFS backend filesystem? ie. don't rely on the protection
> >> >> offered by btrfs. I saw in the docs that wire messages and
> >> journal
> >> >> writes are CRC'd, but nothing explicit about the objects
> >> themselves.
> >> >
> >> > - Everything that passes over the wire is checksummed (crc32c).
> >> This
> >> > is mainly because the TCP checksum is so weak.
> >> >
> >> > - The journal entries have a crc.
> >> >
> >> > - During deep scrub, we read the objects and metadata, calculate a
> >> > crc32c, and compare across replicas.  This detects missing
> >> objects,
> >> > bitrot, failing disks, or anything other source of inconistency.
> >> >
> >> > - Ceph does not calculate and store a per-object checksum.  Doing
> >> so
> >> > is difficult because rados allows arbitrary overwrites of parts of
> >> an
> >> > object.
> >> >
> >> > - Ceph *does* have a new opportunistic checksum feature, which is
> >> > currently only enabled in QA.  It calculates and stores checksums
> >> on
> >> > whatever block size you configure (e.g., 64k) if/when we
> >> > write/overwrite a complete block, and will verify any complete
> >> block
> >> > read against the stored crc, if one happens to be available.  This
> >> can
> >> > help catch some but not all sources of corruption.
> >> >
> >> >> We also have some specific questions:
> >> >>
> >> >> 1. Is an object checksum stored on the OSD somewhere? Is this in
> >> >> user.ceph._, because it wasn't obvious when looking at the code?
> >> >
> >> > No (except for the new/experimental opportunistic crc I mention
> >> > above).
> >> >
> >> >> 2. When is the checksum verified. Surely it is checked during the
> >> >> deep scrub, but what about during an object read?
> >> >
> >> > For non-btrfs, no crc to verify.  For btrfs, the fs has its own
> >> crc
> >> > and verifies it.
> >> >
> >> >> 2b. Can a user read corrupted data if the master replica has a
> >> bit
> >> >> flip but this hasn't yet been found by a deep scrub?
> >> >
> >> > Yes.
> >> >
> >> >> 3. During deep scrub of an object with 2 replicas, suppose the
> >> >> checksum is different for the two objects -- which object wins?
> >> (I.e.
> >> >> if you store the checksum locally, this is trivial since the
> >> >> consistency of objects can be evaluated locally. Without the
> >> local
> >> >> checksum, you can have conflicts.)
> >> >
> >> > In this case we normally choose the primary.  The repair has to be
> >> > explicitly triggered by the admin, however, and there are some
> >> options
> >> > to control that choice.
> >> >
> >> >> 4. If the checksum is already stored per object in the OSD, is
> >> this
> >> >> retrievable by librados? We have some applications which also
> >> need to
> >> >> know the checksum of the data and this would be handy if it was
> >> >> already calculated by Ceph.
> >> >
> >> > It would!  It may be that the way to get there is to build and API
> >> to
> >> > expose the opportunistic checksums, and/or to extend that feature
> >> to
> >> > maintain full checksums (by re-reading partially overwritten
> >> blocks on
> >> > write).  (Note, however, that even this wouldn't cover xattrs and
> >> omap
> >> > content; really this is something that "should" be handled by the
> >> > backend storage/file system.)
> >> >
> >> > sage
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@xxxxxxxxxxxxxx
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com