Re: new scrub and repair discussion

kefu chai <tchaikov@xxxxxxxxx> · Thu, 19 May 2016 21:09:40 +0800

hi cephers,

I'd like to keep you guys posted on the progress of the scrub/repair
feature. And also want to valuable comments/suggestions on it from
you! Now, I am working on the repair-write API for the scrub/repair
feature.

the API looks like:

    /**
     * Rewrite the object with the replica hosted by specified osd
     *
     * @param osd from which OSD we will copy the data
     * @param version the version of rewritten object
     * @param what the flags indicating what we will copy
     */
    int repair_copy(const std::string& oid, uint64_t version, uint32_t
what, int32_t osd, uint32_t epoch);

in which,
- `version` is the version of the object you expect to be repairing in
case of a racing write;
- `what` is an OR'ed flags of follow enum:
- `epoch` like the other scrub/repairing APIs, epoch indicating the
scrub interval is passed in.

struct repair_copy_t {
  enum {
    DATA = 1 << 0,
    OMAP = 1 << 1,
    ATTR = 1 << 2,
  };
};

a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
the shard/replica from specified source OSD to the acting set. and the
machinery of copy_from is reused to implement this feature. so after
rewriting the object, a version is increased, so that possibly corrupt
copies on down OSDs will get fixed naturally.

for the code, see
- https://github.com/ceph/ceph/pull/9203

for the draft design, see
- http://tracker.ceph.com/issues/13508
- http://pad.ceph.com/p/scrub_repair

the API for fixing snapset will be added later.

On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@xxxxxxxxx> wrote:
> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@xxxxxxxxx> wrote:
>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@xxxxxxxxx>:
>>> currently, scrub and repair are pretty primitive. there are several
>>> improvements which need to be made:
>>>
>>> - user should be able to initialize scrub of a PG or an object
>>>     - int scrub(pg_t, AioCompletion*)
>>>     - int scrub(const string& pool, const string& nspace, const
>>> string& locator, const string& oid, AioCompletion*)
>>> - we need a way to query the result of the most recent scrub on a pg.
>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>> paged<inconsistent_t>*)
>>> - the user should be able to query the content of the replica/shard
>>> objects in the event of an inconsistency.
>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>> ObjectReadOperation *op, bool allow_inconsistent)
>>> - the user should be able to perform following fixes using a new
>>> aio_operate_scrub(
>>>                                           const std::string& oid,
>>>                                           shard_id_t shard,
>>>                                           AioCompletion *c,
>>>                                           ObjectWriteOperation *op)
>>>     - specify which replica to use for repairing a content inconsistency
>>>     - delete an object if it can't exist
>>>     - write_full
>>>     - omap_set
>>>     - setattrs
>>> - the user should be able to repair snapset and object_info_t
>>>     - ObjectWriteOperation::repair_snapset(...)
>>>         - set/remove any property/attributes, for example,
>>>             - to reset snapset.clone_overlap
>>>             - to set snapset.clone_size
>>>             - to reset the digests in object_info_t,
>>> - repair will create a new version so that possibly corrupted copies
>>> on down OSDs will get fixed naturally.
>>>
>>
>> I think this exposes too much things to the user. Usually a user
>> doesn't have knowledges like this. If we make it too much complicated,
>> no one will use it at the end.
>
> well, i tend to agree with you to some degree. this is a set of very low
> level APIs exposed to user, but we will accompany them with some
> ready-to-use policies to repair the typical inconsistencies. like the
> sample code attached at the end of this mail. but the point here is
> that we will not burden the OSD daemon will all of these complicated
> logic to fix and repair things. and let the magic happen out side of
> the ceph-osd in a more flexible way. for the advanced users, if they
> want to explore the possibilities to fix the inconsistencies in their own
> way, they won't be disappointed also.
>
>>
>>> so librados will offer enough information and facilities, with which a
>>> smart librados client/script will be able to fix the inconsistencies
>>> found in the scrub.
>>>
>>> as an example, if we run into a data inconsistency where the 3
>>> replicas failed to agree with each other after performing a deep
>>> scrub. probably we'd like to have an election to get the auth copy.
>>> following pseudo code explains how we will implement this using the
>>> new rados APIs for scrub and repair.
>>>
>>>      # something is not necessarily better than nothing
>>>      rados.aio_scrub(pg, completion)
>>>      completion.wait_for_complete()
>>>      for pool in rados.get_inconsistent_pools():
>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>
>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>> epoch).items():
>>>                     if inconsistent.is_data_digest_mismatch():
>>>                          votes = defaultdict(int)
>>>                          for osd, shard_info in inconsistent.shards:
>>>                               votes[shard_info.object_info.data_digest] += 1
>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>                          auth_copy = None
>>>                          for osd, shard_info in inconsistent.shards.items():
>>>                               if shard_info.object_info.data_digest == digest:
>>>                                    auth_copy = osd
>>>                                    break
>>>                          repair_op = librados.ObjectWriteOperation()
>>>                          repair_op.repair_pick(auth_copy,
>>> inconsistent.ver, epoch)
>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>
>>> this plan was also discussed in the infernalis CDS. see
>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Regards
> Kefu Chai

-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html