It was long ago and Linux was very different ..... With respect to today, we found quite a few cases of bad RAID cards which had limited ECC checking on their memory, Stuck bits had serious impacts given our data transit volumes :-( While the root causes we found in the past may be less likely today (as we move towards replicas and away from hardware RAID), keeping in place a background scrubbing and method to identify components which could be potentially causing corruption by external probing and quality checks is very useful. Tim > -----Original Message----- > From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of james@xxxxxxxxxxxx > Sent: 16 October 2013 20:06 > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: bit correctness and checksumming > > Very interesting link. I don't suppose there is any data available separating 4K and 512-byte sectored drives? > > > On 2013-10-16 18:43, Tim Bell wrote: > > At CERN, we have had cases in the past of silent corruptions. It is > > good to be able to identify the devices causing them and swap them > > out. > > > > It's an old presentation but the concepts are still relevant today > > ... > > http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf > > > > Tim > > > > > >> -----Original Message----- > >> From: ceph-users-bounces@xxxxxxxxxxxxxx > >> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > >> james@xxxxxxxxxxxx > >> Sent: 16 October 2013 18:54 > >> To: ceph-users@xxxxxxxxxxxxxx > >> Subject: Re: bit correctness and checksumming > >> > >> > >> Does Ceph log anywhere corrected(/caught) silent corruption - would > >> be interesting to know how much a problem this is, in a large scale > >> deployment. Something to gather in the league table mentioned at > >> the London Ceph day? > >> > >> Just thinking out-loud (please shout me down...) - if the FS itself > >> performs it's own ECC, ATA streaming command set might be of use to > >> avoid performance degradation due to drive level recovery at all. > >> > >> > >> On 2013-10-16 17:12, Sage Weil wrote: > >> > On Wed, 16 Oct 2013, Dan Van Der Ster wrote: > >> >> Hi all, > >> >> There has been some confusion the past couple days at the CHEP > >> >> conference during conversations about Ceph and protection from > >> bit > >> >> flips or other subtle data corruption. Can someone please > >> summarise > >> >> the current state of data integrity protection in Ceph, assuming > >> we > >> >> have an XFS backend filesystem? ie. don't rely on the protection > >> >> offered by btrfs. I saw in the docs that wire messages and > >> journal > >> >> writes are CRC'd, but nothing explicit about the objects > >> themselves. > >> > > >> > - Everything that passes over the wire is checksummed (crc32c). > >> This > >> > is mainly because the TCP checksum is so weak. > >> > > >> > - The journal entries have a crc. > >> > > >> > - During deep scrub, we read the objects and metadata, calculate a > >> > crc32c, and compare across replicas. This detects missing > >> objects, > >> > bitrot, failing disks, or anything other source of inconistency. > >> > > >> > - Ceph does not calculate and store a per-object checksum. Doing > >> so > >> > is difficult because rados allows arbitrary overwrites of parts of > >> an > >> > object. > >> > > >> > - Ceph *does* have a new opportunistic checksum feature, which is > >> > currently only enabled in QA. It calculates and stores checksums > >> on > >> > whatever block size you configure (e.g., 64k) if/when we > >> > write/overwrite a complete block, and will verify any complete > >> block > >> > read against the stored crc, if one happens to be available. This > >> can > >> > help catch some but not all sources of corruption. > >> > > >> >> We also have some specific questions: > >> >> > >> >> 1. Is an object checksum stored on the OSD somewhere? Is this in > >> >> user.ceph._, because it wasn't obvious when looking at the code? > >> > > >> > No (except for the new/experimental opportunistic crc I mention > >> > above). > >> > > >> >> 2. When is the checksum verified. Surely it is checked during the > >> >> deep scrub, but what about during an object read? > >> > > >> > For non-btrfs, no crc to verify. For btrfs, the fs has its own > >> crc > >> > and verifies it. > >> > > >> >> 2b. Can a user read corrupted data if the master replica has a > >> bit > >> >> flip but this hasn't yet been found by a deep scrub? > >> > > >> > Yes. > >> > > >> >> 3. During deep scrub of an object with 2 replicas, suppose the > >> >> checksum is different for the two objects -- which object wins? > >> (I.e. > >> >> if you store the checksum locally, this is trivial since the > >> >> consistency of objects can be evaluated locally. Without the > >> local > >> >> checksum, you can have conflicts.) > >> > > >> > In this case we normally choose the primary. The repair has to be > >> > explicitly triggered by the admin, however, and there are some > >> options > >> > to control that choice. > >> > > >> >> 4. If the checksum is already stored per object in the OSD, is > >> this > >> >> retrievable by librados? We have some applications which also > >> need to > >> >> know the checksum of the data and this would be handy if it was > >> >> already calculated by Ceph. > >> > > >> > It would! It may be that the way to get there is to build and API > >> to > >> > expose the opportunistic checksums, and/or to extend that feature > >> to > >> > maintain full checksums (by re-reading partially overwritten > >> blocks on > >> > write). (Note, however, that even this wouldn't cover xattrs and > >> omap > >> > content; really this is something that "should" be handled by the > >> > backend storage/file system.) > >> > > >> > sage > >> > _______________________________________________ > >> > ceph-users mailing list > >> > ceph-users@xxxxxxxxxxxxxx > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com