-----Original Message-----
From: ceph-users-bounces@xxxxxxxxxxxxxx
[mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
james@xxxxxxxxxxxx
Sent: 16 October 2013 18:54
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re: bit correctness and checksumming
Does Ceph log anywhere corrected(/caught) silent corruption - would
be interesting to know how much a problem this is, in a large scale
deployment. Something to gather in the league table mentioned at
the London Ceph day?
Just thinking out-loud (please shout me down...) - if the FS itself
performs it's own ECC, ATA streaming command set might be of use to
avoid performance degradation due to drive level recovery at all.
On 2013-10-16 17:12, Sage Weil wrote:
> On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
>> Hi all,
>> There has been some confusion the past couple days at the CHEP
>> conference during conversations about Ceph and protection from
bit
>> flips or other subtle data corruption. Can someone please
summarise
>> the current state of data integrity protection in Ceph, assuming
we
>> have an XFS backend filesystem? ie. don't rely on the protection
>> offered by btrfs. I saw in the docs that wire messages and
journal
>> writes are CRC'd, but nothing explicit about the objects
themselves.
>
> - Everything that passes over the wire is checksummed (crc32c).
This
> is mainly because the TCP checksum is so weak.
>
> - The journal entries have a crc.
>
> - During deep scrub, we read the objects and metadata, calculate a
> crc32c, and compare across replicas. This detects missing
objects,
> bitrot, failing disks, or anything other source of inconistency.
>
> - Ceph does not calculate and store a per-object checksum. Doing
so
> is difficult because rados allows arbitrary overwrites of parts of
an
> object.
>
> - Ceph *does* have a new opportunistic checksum feature, which is
> currently only enabled in QA. It calculates and stores checksums
on
> whatever block size you configure (e.g., 64k) if/when we
> write/overwrite a complete block, and will verify any complete
block
> read against the stored crc, if one happens to be available. This
can
> help catch some but not all sources of corruption.
>
>> We also have some specific questions:
>>
>> 1. Is an object checksum stored on the OSD somewhere? Is this in
>> user.ceph._, because it wasn't obvious when looking at the code?
>
> No (except for the new/experimental opportunistic crc I mention
> above).
>
>> 2. When is the checksum verified. Surely it is checked during the
>> deep scrub, but what about during an object read?
>
> For non-btrfs, no crc to verify. For btrfs, the fs has its own
crc
> and verifies it.
>
>> 2b. Can a user read corrupted data if the master replica has a
bit
>> flip but this hasn't yet been found by a deep scrub?
>
> Yes.
>
>> 3. During deep scrub of an object with 2 replicas, suppose the
>> checksum is different for the two objects -- which object wins?
(I.e.
>> if you store the checksum locally, this is trivial since the
>> consistency of objects can be evaluated locally. Without the
local
>> checksum, you can have conflicts.)
>
> In this case we normally choose the primary. The repair has to be
> explicitly triggered by the admin, however, and there are some
options
> to control that choice.
>
>> 4. If the checksum is already stored per object in the OSD, is
this
>> retrievable by librados? We have some applications which also
need to
>> know the checksum of the data and this would be handy if it was
>> already calculated by Ceph.
>
> It would! It may be that the way to get there is to build and API
to
> expose the opportunistic checksums, and/or to extend that feature
to
> maintain full checksums (by re-reading partially overwritten
blocks on
> write). (Note, however, that even this wouldn't cover xattrs and
omap
> content; really this is something that "should" be handled by the
> backend storage/file system.)
>
> sage
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com