Re: new scrub and repair discussion

Samuel Just <sjust@xxxxxxxxxx> · Wed, 25 May 2016 10:37:30 -0700

On Fri, May 20, 2016 at 4:30 AM, kefu chai <tchaikov@xxxxxxxxx> wrote:
> On Fri, May 20, 2016 at 1:55 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>> How would this work for an ec pool?  Maybe the osd argument should be
>> a set of valid peers?
>
> maybe in the case of ec pool, we should ignore the osd argument.
>
> <quote from="http://pad.ceph.com/p/scrub_repair";>
> david points out that for EC pools, don't allow incorrect shards to be
> selected as correct
> - admin might decide to delete an object if it can't exist
> - similar to unfound object case
> - repair should have a use-your-best-judgement flag -- mandatory for
> ec (if not deleting the object)
> - for ec, if we want to read a shard, need to specify the shard id as
> well since an osd might have two shards
>   - request would indicate the shard
> </quote>
>
> because the ec subread does not return the payload if the size or the
> digest fails to match, and instead, an EIO is returned. on the primary
> side, ECBackend::handle_sub_read_reply() will send more subread ops to
> the shard(s) which is yet used if it is unable to reconstruct the
> requested extent with shards already returned. if we want to
> explicitly exclude the "bad" shards from being used, maybe the
> simplest way is to remove it before calling repair-copy. and we can
> offer an API for removing a shard. but I doubt that we need to do
> this. as the chance of having a corrupted shard whose checksum matches
> with its digest stored in its object-info xattr,  is very small. and
> maybe we fix a corrupted shard in ec pool by reading the impacted
> object, and then overwriting the original copy with the reconstructed
> one from the good shards.

I think it would be simpler to just allow the repair_write call to
specify a set of bad shards.  For replicated pools, we simply choose
one which is not in that set.  For EC pools, we use that information
to avoid bad shards.

>
> <quote from="http://pad.ceph.com/p/scrub_repair";>
> repair (need a flag to aio_operate  write variant to allow overwriting
> an unfound object) (needs to bypass snapshotting) (allow to write to a
> clone?) (require x cap bit?)
>  delete
>  writefull ...
>  omap_set_...
>  setattrs ...
> </quote>
>
> do we still need the REPAIR_WRITE flag for overwriting an unfound
> object? i removed an object in osd's store directory, and the
> repair-copy does fix it for me. or I misunderstand this line...
>

It may not be necessary for this mechanism to repair unfound objects,
I merged a new version of that system in the last cycle.  I guess it
depends on what you'd consider most convenient.
-Sam

>> -Sam
>>
>> On Thu, May 19, 2016 at 6:09 AM, kefu chai <tchaikov@xxxxxxxxx> wrote:
>>> hi cephers,
>>>
>>> I'd like to keep you guys posted on the progress of the scrub/repair
>>> feature. And also want to valuable comments/suggestions on it from
>>> you! Now, I am working on the repair-write API for the scrub/repair
>>> feature.
>>>
>>> the API looks like:
>>>
>>>     /**
>>>      * Rewrite the object with the replica hosted by specified osd
>>>      *
>>>      * @param osd from which OSD we will copy the data
>>>      * @param version the version of rewritten object
>>>      * @param what the flags indicating what we will copy
>>>      */
>>>     int repair_copy(const std::string& oid, uint64_t version, uint32_t
>>> what, int32_t osd, uint32_t epoch);
>>>
>>> in which,
>>> - `version` is the version of the object you expect to be repairing in
>>> case of a racing write;
>>> - `what` is an OR'ed flags of follow enum:
>>> - `epoch` like the other scrub/repairing APIs, epoch indicating the
>>> scrub interval is passed in.
>>>
>>> struct repair_copy_t {
>>>   enum {
>>>     DATA = 1 << 0,
>>>     OMAP = 1 << 1,
>>>     ATTR = 1 << 2,
>>>   };
>>> };
>>>
>>> a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
>>> the shard/replica from specified source OSD to the acting set. and the
>>> machinery of copy_from is reused to implement this feature. so after
>>> rewriting the object, a version is increased, so that possibly corrupt
>>> copies on down OSDs will get fixed naturally.
>>>
>>> for the code, see
>>> - https://github.com/ceph/ceph/pull/9203
>>>
>>> for the draft design, see
>>> - http://tracker.ceph.com/issues/13508
>>> - http://pad.ceph.com/p/scrub_repair
>>>
>>> the API for fixing snapset will be added later.
>>>
>>> On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@xxxxxxxxx> wrote:
>>>> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@xxxxxxxxx> wrote:
>>>>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@xxxxxxxxx>:
>>>>>> currently, scrub and repair are pretty primitive. there are several
>>>>>> improvements which need to be made:
>>>>>>
>>>>>> - user should be able to initialize scrub of a PG or an object
>>>>>>     - int scrub(pg_t, AioCompletion*)
>>>>>>     - int scrub(const string& pool, const string& nspace, const
>>>>>> string& locator, const string& oid, AioCompletion*)
>>>>>> - we need a way to query the result of the most recent scrub on a pg.
>>>>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>>>>> paged<inconsistent_t>*)
>>>>>> - the user should be able to query the content of the replica/shard
>>>>>> objects in the event of an inconsistency.
>>>>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>>>>> ObjectReadOperation *op, bool allow_inconsistent)
>>>>>> - the user should be able to perform following fixes using a new
>>>>>> aio_operate_scrub(
>>>>>>                                           const std::string& oid,
>>>>>>                                           shard_id_t shard,
>>>>>>                                           AioCompletion *c,
>>>>>>                                           ObjectWriteOperation *op)
>>>>>>     - specify which replica to use for repairing a content inconsistency
>>>>>>     - delete an object if it can't exist
>>>>>>     - write_full
>>>>>>     - omap_set
>>>>>>     - setattrs
>>>>>> - the user should be able to repair snapset and object_info_t
>>>>>>     - ObjectWriteOperation::repair_snapset(...)
>>>>>>         - set/remove any property/attributes, for example,
>>>>>>             - to reset snapset.clone_overlap
>>>>>>             - to set snapset.clone_size
>>>>>>             - to reset the digests in object_info_t,
>>>>>> - repair will create a new version so that possibly corrupted copies
>>>>>> on down OSDs will get fixed naturally.
>>>>>>
>>>>>
>>>>> I think this exposes too much things to the user. Usually a user
>>>>> doesn't have knowledges like this. If we make it too much complicated,
>>>>> no one will use it at the end.
>>>>
>>>> well, i tend to agree with you to some degree. this is a set of very low
>>>> level APIs exposed to user, but we will accompany them with some
>>>> ready-to-use policies to repair the typical inconsistencies. like the
>>>> sample code attached at the end of this mail. but the point here is
>>>> that we will not burden the OSD daemon will all of these complicated
>>>> logic to fix and repair things. and let the magic happen out side of
>>>> the ceph-osd in a more flexible way. for the advanced users, if they
>>>> want to explore the possibilities to fix the inconsistencies in their own
>>>> way, they won't be disappointed also.
>>>>
>>>>>
>>>>>> so librados will offer enough information and facilities, with which a
>>>>>> smart librados client/script will be able to fix the inconsistencies
>>>>>> found in the scrub.
>>>>>>
>>>>>> as an example, if we run into a data inconsistency where the 3
>>>>>> replicas failed to agree with each other after performing a deep
>>>>>> scrub. probably we'd like to have an election to get the auth copy.
>>>>>> following pseudo code explains how we will implement this using the
>>>>>> new rados APIs for scrub and repair.
>>>>>>
>>>>>>      # something is not necessarily better than nothing
>>>>>>      rados.aio_scrub(pg, completion)
>>>>>>      completion.wait_for_complete()
>>>>>>      for pool in rados.get_inconsistent_pools():
>>>>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>>>>
>>>>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>>>>> epoch).items():
>>>>>>                     if inconsistent.is_data_digest_mismatch():
>>>>>>                          votes = defaultdict(int)
>>>>>>                          for osd, shard_info in inconsistent.shards:
>>>>>>                               votes[shard_info.object_info.data_digest] += 1
>>>>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>>>>                          auth_copy = None
>>>>>>                          for osd, shard_info in inconsistent.shards.items():
>>>>>>                               if shard_info.object_info.data_digest == digest:
>>>>>>                                    auth_copy = osd
>>>>>>                                    break
>>>>>>                          repair_op = librados.ObjectWriteOperation()
>>>>>>                          repair_op.repair_pick(auth_copy,
>>>>>> inconsistent.ver, epoch)
>>>>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>>>>
>>>>>> this plan was also discussed in the infernalis CDS. see
>>>>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>> Kefu Chai
>>>
>>>
>>>
>>> --
>>> Regards
>>> Kefu Chai
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Regards
> Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html