Re: new scrub and repair discussion

kefu chai <tchaikov@xxxxxxxxx> · Fri, 20 May 2016 19:30:05 +0800

On Fri, May 20, 2016 at 1:55 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> How would this work for an ec pool?  Maybe the osd argument should be
> a set of valid peers?

maybe in the case of ec pool, we should ignore the osd argument.

<quote from="http://pad.ceph.com/p/scrub_repair";>
david points out that for EC pools, don't allow incorrect shards to be
selected as correct
- admin might decide to delete an object if it can't exist
- similar to unfound object case
- repair should have a use-your-best-judgement flag -- mandatory for
ec (if not deleting the object)
- for ec, if we want to read a shard, need to specify the shard id as
well since an osd might have two shards
  - request would indicate the shard
</quote>

because the ec subread does not return the payload if the size or the
digest fails to match, and instead, an EIO is returned. on the primary
side, ECBackend::handle_sub_read_reply() will send more subread ops to
the shard(s) which is yet used if it is unable to reconstruct the
requested extent with shards already returned. if we want to
explicitly exclude the "bad" shards from being used, maybe the
simplest way is to remove it before calling repair-copy. and we can
offer an API for removing a shard. but I doubt that we need to do
this. as the chance of having a corrupted shard whose checksum matches
with its digest stored in its object-info xattr,  is very small. and
maybe we fix a corrupted shard in ec pool by reading the impacted
object, and then overwriting the original copy with the reconstructed
one from the good shards.

<quote from="http://pad.ceph.com/p/scrub_repair";>
repair (need a flag to aio_operate  write variant to allow overwriting
an unfound object) (needs to bypass snapshotting) (allow to write to a
clone?) (require x cap bit?)
 delete
 writefull ...
 omap_set_...
 setattrs ...
</quote>

do we still need the REPAIR_WRITE flag for overwriting an unfound
object? i removed an object in osd's store directory, and the
repair-copy does fix it for me. or I misunderstand this line...

> -Sam
>
> On Thu, May 19, 2016 at 6:09 AM, kefu chai <tchaikov@xxxxxxxxx> wrote:
>> hi cephers,
>>
>> I'd like to keep you guys posted on the progress of the scrub/repair
>> feature. And also want to valuable comments/suggestions on it from
>> you! Now, I am working on the repair-write API for the scrub/repair
>> feature.
>>
>> the API looks like:
>>
>>     /**
>>      * Rewrite the object with the replica hosted by specified osd
>>      *
>>      * @param osd from which OSD we will copy the data
>>      * @param version the version of rewritten object
>>      * @param what the flags indicating what we will copy
>>      */
>>     int repair_copy(const std::string& oid, uint64_t version, uint32_t
>> what, int32_t osd, uint32_t epoch);
>>
>> in which,
>> - `version` is the version of the object you expect to be repairing in
>> case of a racing write;
>> - `what` is an OR'ed flags of follow enum:
>> - `epoch` like the other scrub/repairing APIs, epoch indicating the
>> scrub interval is passed in.
>>
>> struct repair_copy_t {
>>   enum {
>>     DATA = 1 << 0,
>>     OMAP = 1 << 1,
>>     ATTR = 1 << 2,
>>   };
>> };
>>
>> a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy
>> the shard/replica from specified source OSD to the acting set. and the
>> machinery of copy_from is reused to implement this feature. so after
>> rewriting the object, a version is increased, so that possibly corrupt
>> copies on down OSDs will get fixed naturally.
>>
>> for the code, see
>> - https://github.com/ceph/ceph/pull/9203
>>
>> for the draft design, see
>> - http://tracker.ceph.com/issues/13508
>> - http://pad.ceph.com/p/scrub_repair
>>
>> the API for fixing snapset will be added later.
>>
>> On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@xxxxxxxxx> wrote:
>>> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@xxxxxxxxx> wrote:
>>>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@xxxxxxxxx>:
>>>>> currently, scrub and repair are pretty primitive. there are several
>>>>> improvements which need to be made:
>>>>>
>>>>> - user should be able to initialize scrub of a PG or an object
>>>>>     - int scrub(pg_t, AioCompletion*)
>>>>>     - int scrub(const string& pool, const string& nspace, const
>>>>> string& locator, const string& oid, AioCompletion*)
>>>>> - we need a way to query the result of the most recent scrub on a pg.
>>>>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>>>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>>>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>>>>> paged<inconsistent_t>*)
>>>>> - the user should be able to query the content of the replica/shard
>>>>> objects in the event of an inconsistency.
>>>>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>>>>> ObjectReadOperation *op, bool allow_inconsistent)
>>>>> - the user should be able to perform following fixes using a new
>>>>> aio_operate_scrub(
>>>>>                                           const std::string& oid,
>>>>>                                           shard_id_t shard,
>>>>>                                           AioCompletion *c,
>>>>>                                           ObjectWriteOperation *op)
>>>>>     - specify which replica to use for repairing a content inconsistency
>>>>>     - delete an object if it can't exist
>>>>>     - write_full
>>>>>     - omap_set
>>>>>     - setattrs
>>>>> - the user should be able to repair snapset and object_info_t
>>>>>     - ObjectWriteOperation::repair_snapset(...)
>>>>>         - set/remove any property/attributes, for example,
>>>>>             - to reset snapset.clone_overlap
>>>>>             - to set snapset.clone_size
>>>>>             - to reset the digests in object_info_t,
>>>>> - repair will create a new version so that possibly corrupted copies
>>>>> on down OSDs will get fixed naturally.
>>>>>
>>>>
>>>> I think this exposes too much things to the user. Usually a user
>>>> doesn't have knowledges like this. If we make it too much complicated,
>>>> no one will use it at the end.
>>>
>>> well, i tend to agree with you to some degree. this is a set of very low
>>> level APIs exposed to user, but we will accompany them with some
>>> ready-to-use policies to repair the typical inconsistencies. like the
>>> sample code attached at the end of this mail. but the point here is
>>> that we will not burden the OSD daemon will all of these complicated
>>> logic to fix and repair things. and let the magic happen out side of
>>> the ceph-osd in a more flexible way. for the advanced users, if they
>>> want to explore the possibilities to fix the inconsistencies in their own
>>> way, they won't be disappointed also.
>>>
>>>>
>>>>> so librados will offer enough information and facilities, with which a
>>>>> smart librados client/script will be able to fix the inconsistencies
>>>>> found in the scrub.
>>>>>
>>>>> as an example, if we run into a data inconsistency where the 3
>>>>> replicas failed to agree with each other after performing a deep
>>>>> scrub. probably we'd like to have an election to get the auth copy.
>>>>> following pseudo code explains how we will implement this using the
>>>>> new rados APIs for scrub and repair.
>>>>>
>>>>>      # something is not necessarily better than nothing
>>>>>      rados.aio_scrub(pg, completion)
>>>>>      completion.wait_for_complete()
>>>>>      for pool in rados.get_inconsistent_pools():
>>>>>           for pg in rados.get_inconsistent_pgs(pool):
>>>>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>>>>
>>>>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>>>>> epoch).items():
>>>>>                     if inconsistent.is_data_digest_mismatch():
>>>>>                          votes = defaultdict(int)
>>>>>                          for osd, shard_info in inconsistent.shards:
>>>>>                               votes[shard_info.object_info.data_digest] += 1
>>>>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>>>>                          auth_copy = None
>>>>>                          for osd, shard_info in inconsistent.shards.items():
>>>>>                               if shard_info.object_info.data_digest == digest:
>>>>>                                    auth_copy = osd
>>>>>                                    break
>>>>>                          repair_op = librados.ObjectWriteOperation()
>>>>>                          repair_op.repair_pick(auth_copy,
>>>>> inconsistent.ver, epoch)
>>>>>                          rados.aio_operate_scrub(oid, repair_op)
>>>>>
>>>>> this plan was also discussed in the infernalis CDS. see
>>>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Regards
>>> Kefu Chai
>>
>>
>>
>> --
>> Regards
>> Kefu Chai
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html