Re: Copying without crc check when peering may lack reliability

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 4 Sep 2018 11:39:50 -0700

Oh my bad, 12.2.6 was the one that was broken that way. 12.2.5 just
had a pg log+EC issue.

On Tue, Sep 4, 2018 at 11:28 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Fri, Aug 31, 2018 at 2:20 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> On Thu, 30 Aug 2018, Gregory Farnum wrote:
>>> On Thu, Aug 23, 2018 at 8:38 AM, poi <poiiiicen@xxxxxxxxx> wrote:
>>> > Hello!
>>> >
>>> > Recently, we did data migration from one crush root to another, but
>>> > after that, we found some objects were wrong and their copies on other
>>> > OSDs were also wrong.
>>> >
>>> > Finally, we found that for one pg, the data migration uses only one
>>> > OSD's data to generate three new copies, and do not check the crc
>>> > before migration like assuming the data is always correct (but
>>> > actually nobody can promise it). We tried both filestore and
>>> > bluestore, and the results were the same. Copying from one pg without
>>> > crc check may lack reliability.
>>>
>>> Exactly what version are you running, and what backends? Are you
>>> actually using BlueStore?
>>>
>>> This is certainly the general case with replicated pools on FileStore,
>>> but it shouldn't happen with BlueStore or EC pools at all. We aren't
>>> going to implement "voting" on FileStore-backed OSDs though as that
>>> would vastly multiply the cost of backfilling. :(
>>
>> If I'm understanding, assuming you're running bluestore, the checksums on
>> each OSD are (internally) correct (so reads succeed), but the actual data
>> stored on the different OSDs is different.  And then you trigger a
>> rebalance or recovery.  Is that right?
>>
>> This is something a deep scrub will normally pick up in due time, but in a
>> recovery case, we assume the existing replicas are consistent and base any
>> new replica off of the primary's copy.
>
> But if we're returning EIO on regular read of the object, that's
> obviously doing some checking we don't do in the backfill or recovery
> case. Sounds like it's comparing the hobject's global checksum to what
> it reads but we don't check that on recovery.
>
> Which, well, we should! Might be some issues recovering from it if
> there's a failure in that path, but such is life.
>
> ...but oh, I just realized this is 12.2.5 (aka: the broken one). So I
> think this is already resolved in the 12.2.7 fixes? David? I created
> http://tracker.ceph.com/issues/35542 before I realized, so we can
> either prioritize that or close it as fixed.
> -Greg
>
>>
>> The way to "fix" that might be to do a deep-scrub on an object before it
>> is recovered.  I'm pretty sure we don't want to incur that kind of
>> overhead (it would slow recovery way down).  And in the recovery case
>> where you are down one or more replicas, you're always going to have to
>> base the new replica off an existing replica.  In theory if you're
>> making multiple replicas you could source them from different
>> existing copies, but in practice that's inefficient and doesn't align
>> with how the OSD implements (everything is push/pull from the
>> primary).
>>
>> sage
>>
>>
>>
>>> -Greg
>>>
>>> >
>>> > Is there any way to ensure the correctness of data when data
>>> > migration? Although we can do deep scrub before migration, but the
>>> > cost is too high. I think when peering, adding crc check for objects
>>> > before copying may work.
>>> >
>>> > Regards
>>> >
>>> > Poi
>>>
>>>