Re: Copying without crc check when peering may lack reliability

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 4 Sep 2018 11:28:02 -0700

On Fri, Aug 31, 2018 at 2:20 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 30 Aug 2018, Gregory Farnum wrote:
>> On Thu, Aug 23, 2018 at 8:38 AM, poi <poiiiicen@xxxxxxxxx> wrote:
>> > Hello!
>> >
>> > Recently, we did data migration from one crush root to another, but
>> > after that, we found some objects were wrong and their copies on other
>> > OSDs were also wrong.
>> >
>> > Finally, we found that for one pg, the data migration uses only one
>> > OSD's data to generate three new copies, and do not check the crc
>> > before migration like assuming the data is always correct (but
>> > actually nobody can promise it). We tried both filestore and
>> > bluestore, and the results were the same. Copying from one pg without
>> > crc check may lack reliability.
>>
>> Exactly what version are you running, and what backends? Are you
>> actually using BlueStore?
>>
>> This is certainly the general case with replicated pools on FileStore,
>> but it shouldn't happen with BlueStore or EC pools at all. We aren't
>> going to implement "voting" on FileStore-backed OSDs though as that
>> would vastly multiply the cost of backfilling. :(
>
> If I'm understanding, assuming you're running bluestore, the checksums on
> each OSD are (internally) correct (so reads succeed), but the actual data
> stored on the different OSDs is different.  And then you trigger a
> rebalance or recovery.  Is that right?
>
> This is something a deep scrub will normally pick up in due time, but in a
> recovery case, we assume the existing replicas are consistent and base any
> new replica off of the primary's copy.

But if we're returning EIO on regular read of the object, that's
obviously doing some checking we don't do in the backfill or recovery
case. Sounds like it's comparing the hobject's global checksum to what
it reads but we don't check that on recovery.

Which, well, we should! Might be some issues recovering from it if
there's a failure in that path, but such is life.

...but oh, I just realized this is 12.2.5 (aka: the broken one). So I
think this is already resolved in the 12.2.7 fixes? David? I created
http://tracker.ceph.com/issues/35542 before I realized, so we can
either prioritize that or close it as fixed.
-Greg

>
> The way to "fix" that might be to do a deep-scrub on an object before it
> is recovered.  I'm pretty sure we don't want to incur that kind of
> overhead (it would slow recovery way down).  And in the recovery case
> where you are down one or more replicas, you're always going to have to
> base the new replica off an existing replica.  In theory if you're
> making multiple replicas you could source them from different
> existing copies, but in practice that's inefficient and doesn't align
> with how the OSD implements (everything is push/pull from the
> primary).
>
> sage
>
>
>
>> -Greg
>>
>> >
>> > Is there any way to ensure the correctness of data when data
>> > migration? Although we can do deep scrub before migration, but the
>> > cost is too high. I think when peering, adding crc check for objects
>> > before copying may work.
>> >
>> > Regards
>> >
>> > Poi
>>
>>