Re: new scrub and repair discussion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@xxxxxxxxx> wrote:
> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@xxxxxxxxx>:
>> currently, scrub and repair are pretty primitive. there are several
>> improvements which need to be made:
>>
>> - user should be able to initialize scrub of a PG or an object
>>     - int scrub(pg_t, AioCompletion*)
>>     - int scrub(const string& pool, const string& nspace, const
>> string& locator, const string& oid, AioCompletion*)
>> - we need a way to query the result of the most recent scrub on a pg.
>>     - int get_inconsistent_pools(set<uint64_t>* pools);
>>     - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs);
>>     - int get_inconsistent(pg_t pgid, epoch_t* cur_interval,
>> paged<inconsistent_t>*)
>> - the user should be able to query the content of the replica/shard
>> objects in the event of an inconsistency.
>>     - operate_on_shard(epoch_t interval, pg_shard_t pg_shard,
>> ObjectReadOperation *op, bool allow_inconsistent)
>> - the user should be able to perform following fixes using a new
>> aio_operate_scrub(
>>                                           const std::string& oid,
>>                                           shard_id_t shard,
>>                                           AioCompletion *c,
>>                                           ObjectWriteOperation *op)
>>     - specify which replica to use for repairing a content inconsistency
>>     - delete an object if it can't exist
>>     - write_full
>>     - omap_set
>>     - setattrs
>> - the user should be able to repair snapset and object_info_t
>>     - ObjectWriteOperation::repair_snapset(...)
>>         - set/remove any property/attributes, for example,
>>             - to reset snapset.clone_overlap
>>             - to set snapset.clone_size
>>             - to reset the digests in object_info_t,
>> - repair will create a new version so that possibly corrupted copies
>> on down OSDs will get fixed naturally.
>>
>
> I think this exposes too much things to the user. Usually a user
> doesn't have knowledges like this. If we make it too much complicated,
> no one will use it at the end.

well, i tend to agree with you to some degree. this is a set of very low
level APIs exposed to user, but we will accompany them with some
ready-to-use policies to repair the typical inconsistencies. like the
sample code attached at the end of this mail. but the point here is
that we will not burden the OSD daemon will all of these complicated
logic to fix and repair things. and let the magic happen out side of
the ceph-osd in a more flexible way. for the advanced users, if they
want to explore the possibilities to fix the inconsistencies in their own
way, they won't be disappointed also.

>
>> so librados will offer enough information and facilities, with which a
>> smart librados client/script will be able to fix the inconsistencies
>> found in the scrub.
>>
>> as an example, if we run into a data inconsistency where the 3
>> replicas failed to agree with each other after performing a deep
>> scrub. probably we'd like to have an election to get the auth copy.
>> following pseudo code explains how we will implement this using the
>> new rados APIs for scrub and repair.
>>
>>      # something is not necessarily better than nothing
>>      rados.aio_scrub(pg, completion)
>>      completion.wait_for_complete()
>>      for pool in rados.get_inconsistent_pools():
>>           for pg in rados.get_inconsistent_pgs(pool):
>>                # rados.get_inconsistent_pgs() throws if "epoch" expires
>>
>>                for oid, inconsistent in rados.get_inconsistent_pgs(pg,
>> epoch).items():
>>                     if inconsistent.is_data_digest_mismatch():
>>                          votes = defaultdict(int)
>>                          for osd, shard_info in inconsistent.shards:
>>                               votes[shard_info.object_info.data_digest] += 1
>>                          digest, _ = mavotes, key=operator.itemgetter(1))
>>                          auth_copy = None
>>                          for osd, shard_info in inconsistent.shards.items():
>>                               if shard_info.object_info.data_digest == digest:
>>                                    auth_copy = osd
>>                                    break
>>                          repair_op = librados.ObjectWriteOperation()
>>                          repair_op.repair_pick(auth_copy,
>> inconsistent.ver, epoch)
>>                          rados.aio_operate_scrub(oid, repair_op)
>>
>> this plan was also discussed in the infernalis CDS. see
>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux