Re: new scrub and repair discussion

kefu chai <tchaikov@xxxxxxxxx> · Tue, 7 Jun 2016 21:13:27 +0800

On Thu, May 26, 2016 at 1:37 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> On Fri, May 20, 2016 at 4:30 AM, kefu chai <tchaikov@xxxxxxxxx> wrote:
>> On Fri, May 20, 2016 at 1:55 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>>> How would this work for an ec pool?  Maybe the osd argument should be
>>> a set of valid peers?
>>
>> maybe in the case of ec pool, we should ignore the osd argument.
>>
>> <quote from="http://pad.ceph.com/p/scrub_repair";>
>> david points out that for EC pools, don't allow incorrect shards to be
>> selected as correct
>> - admin might decide to delete an object if it can't exist
>> - similar to unfound object case
>> - repair should have a use-your-best-judgement flag -- mandatory for
>> ec (if not deleting the object)
>> - for ec, if we want to read a shard, need to specify the shard id as
>> well since an osd might have two shards
>>   - request would indicate the shard
>> </quote>
>>
>> because the ec subread does not return the payload if the size or the
>> digest fails to match, and instead, an EIO is returned. on the primary
>> side, ECBackend::handle_sub_read_reply() will send more subread ops to
>> the shard(s) which is yet used if it is unable to reconstruct the
>> requested extent with shards already returned. if we want to
>> explicitly exclude the "bad" shards from being used, maybe the
>> simplest way is to remove it before calling repair-copy. and we can
>> offer an API for removing a shard. but I doubt that we need to do
>> this. as the chance of having a corrupted shard whose checksum matches
>> with its digest stored in its object-info xattr,  is very small. and
>> maybe we fix a corrupted shard in ec pool by reading the impacted
>> object, and then overwriting the original copy with the reconstructed
>> one from the good shards.
>
> I think it would be simpler to just allow the repair_write call to
> specify a set of bad shards.  For replicated pools, we simply choose
> one which is not in that set.  For EC pools, we use that information
> to avoid bad shards.

for the replicated pool, will choose a random replica from the acting set
as long as it's not listed in the black list as the authenticated copy of the
repair.

for the ec pool, the OSD does not return shards with wrong digest at all
when handling sub read requests. so we should ignore the digest mismatch
error when reading shards. because:
 - we mark a shard inconsistent if its digest does not match with the
one stored in the shard's hash_info. (please note that each shard has
the digests of all shards of that object).
 - the user could put the consistent shard reported by
list-inconsistent-obj into the blacklist. it's a little bit scary
though.

-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html