Re: new scrub and repair discussion

Samuel Just <sjust@xxxxxxxxxx> · Thu, 19 May 2016 10:55:58 -0700



How would this work for an ec pool?  Maybe the osd argument should be
a set of valid peers?
-Sam

On Thu, May 19, 2016 at 6:09 AM, kefu chai <tchaikov@xxxxxxxxx> wrote:
> hi cephers,
>
> I'd like to keep you guys posted on the progress of the scrub/repair
> feature. And also want to valuable comments/suggestions on it from
> you! Now, I am working on the repair-write API for the scrub/repair
> feature.
>
> the API looks like:
>
>     /**
>      * Rewrite the object with the replica hosted by specified osd
>      *
>      * @param osd from which OSD we will copy the data
>      * @param version the version of rewritten object
>      * @param what the flags indicating what we will copy
>      */
>     int repair_copy(const std::string& oid, uint64_t version, uint32_t
> what, int32_t osd, uint32_t epoch);
>
> in which,
> - `version` is the version of the object you expect to be repairing in
> case of a racing write;
> - `what` is an OR'ed flags of follow enum:
> - `epoch` like the other scrub/repairing APIs, epoch indicating the
> scrub interval is passed in.
>
> struct repair_copy_t {
>   enum {
>     DATA = 1 << 0,
>     OMAP = 1 << 1,
>     ATTR = 1 << 2,
>   };
> };
>
> a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
> the shard/replica from specified source OSD to the acting set. and the
> machinery of copy_from is reused to implement this feature. so after
> rewriting the object, a version is increased, so that possibly corrupt
> copies on down OSDs will get fixed naturally.
>
> for the code, see
> - https://github.com/ceph/ceph/pull/9203
>
> for the draft design, see
> - http://tracker.ceph.com/issues/13508
> - http://pad.ceph.com/p/scrub_repair
>
> the API for fixing snapset will be added later.
>
> On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@xxxxxxxxx> wrote:
>> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@xxxxxxxxx> wrote:
>>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@xxxxxxxxx>:
>>>> currently, scrub and repair are pretty primitive. there are several
>>>> improvements which need to be made:
>>>>
>>>> - user should be able to initialize scrub of a PG or an object
>>>>     - int scrub(pg_t, AioCompletion*)
>>>>     - int scrub(const string& pool, const string& nspace, const
>>>> string& locator, const string& oid, AioCompletion*)
>>>> - we need a way to query the result of the most recent scrub on a pg.
>>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>>> paged<inconsistent_t>*)
>>>> - the user should be able to query the content of the replica/shard
>>>> objects in the event of an inconsistency.
>>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>>> ObjectReadOperation *op, bool allow_inconsistent)
>>>> - the user should be able to perform following fixes using a new
>>>> aio_operate_scrub(
>>>>                                           const std::string& oid,
>>>>                                           shard_id_t shard,
>>>>                                           AioCompletion *c,
>>>>                                           ObjectWriteOperation *op)
>>>>     - specify which replica to use for repairing a content inconsistency
>>>>     - delete an object if it can't exist
>>>>     - write_full
>>>>     - omap_set
>>>>     - setattrs
>>>> - the user should be able to repair snapset and object_info_t
>>>>     - ObjectWriteOperation::repair_snapset(...)
>>>>         - set/remove any property/attributes, for example,
>>>>             - to reset snapset.clone_overlap
>>>>             - to set snapset.clone_size
>>>>             - to reset the digests in object_info_t,
>>>> - repair will create a new version so that possibly corrupted copies
>>>> on down OSDs will get fixed naturally.
>>>>
>>>
>>> I think this exposes too much things to the user. Usually a user
>>> doesn't have knowledges like this. If we make it too much complicated,
>>> no one will use it at the end.
>>
>> well, i tend to agree with you to some degree. this is a set of very low
>> level APIs exposed to user, but we will accompany them with some
>> ready-to-use policies to repair the typical inconsistencies. like the
>> sample code attached at the end of this mail. but the point here is
>> that we will not burden the OSD daemon will all of these complicated
>> logic to fix and repair things. and let the magic happen out side of
>> the ceph-osd in a more flexible way. for the advanced users, if they
>> want to explore the possibilities to fix the inconsistencies in their own
>> way, they won't be disappointed also.
>>
>>>
>>>> so librados will offer enough information and facilities, with which a
>>>> smart librados client/script will be able to fix the inconsistencies
>>>> found in the scrub.
>>>>
>>>> as an example, if we run into a data inconsistency where the 3
>>>> replicas failed to agree with each other after performing a deep
>>>> scrub. probably we'd like to have an election to get the auth copy.
>>>> following pseudo code explains how we will implement this using the
>>>> new rados APIs for scrub and repair.
>>>>
>>>>      # something is not necessarily better than nothing
>>>>      rados.aio_scrub(pg, completion)
>>>>      completion.wait_for_complete()
>>>>      for pool in rados.get_inconsistent_pools():
>>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>>
>>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>>> epoch).items():
>>>>                     if inconsistent.is_data_digest_mismatch():
>>>>                          votes = defaultdict(int)
>>>>                          for osd, shard_info in inconsistent.shards:
>>>>                               votes[shard_info.object_info.data_digest] += 1
>>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>>                          auth_copy = None
>>>>                          for osd, shard_info in inconsistent.shards.items():
>>>>                               if shard_info.object_info.data_digest == digest:
>>>>                                    auth_copy = osd
>>>>                                    break
>>>>                          repair_op = librados.ObjectWriteOperation()
>>>>                          repair_op.repair_pick(auth_copy,
>>>> inconsistent.ver, epoch)
>>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>>
>>>> this plan was also discussed in the infernalis CDS. see
>>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Regards
>> Kefu Chai
>
>
>
> --
> Regards
> Kefu Chai
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html