On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@xxxxxxxxx> wrote: > 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@xxxxxxxxx>: >> currently, scrub and repair are pretty primitive. there are several >> improvements which need to be made: >> >> - user should be able to initialize scrub of a PG or an object >> - int scrub(pg_t, AioCompletion*) >> - int scrub(const string& pool, const string& nspace, const >> string& locator, const string& oid, AioCompletion*) >> - we need a way to query the result of the most recent scrub on a pg. >> - int get_inconsistent_pools(set<uint64_t>* pools); >> - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs); >> - int get_inconsistent(pg_t pgid, epoch_t* cur_interval, >> paged<inconsistent_t>*) >> - the user should be able to query the content of the replica/shard >> objects in the event of an inconsistency. >> - operate_on_shard(epoch_t interval, pg_shard_t pg_shard, >> ObjectReadOperation *op, bool allow_inconsistent) >> - the user should be able to perform following fixes using a new >> aio_operate_scrub( >> const std::string& oid, >> shard_id_t shard, >> AioCompletion *c, >> ObjectWriteOperation *op) >> - specify which replica to use for repairing a content inconsistency >> - delete an object if it can't exist >> - write_full >> - omap_set >> - setattrs >> - the user should be able to repair snapset and object_info_t >> - ObjectWriteOperation::repair_snapset(...) >> - set/remove any property/attributes, for example, >> - to reset snapset.clone_overlap >> - to set snapset.clone_size >> - to reset the digests in object_info_t, >> - repair will create a new version so that possibly corrupted copies >> on down OSDs will get fixed naturally. >> > > I think this exposes too much things to the user. Usually a user > doesn't have knowledges like this. If we make it too much complicated, > no one will use it at the end. well, i tend to agree with you to some degree. this is a set of very low level APIs exposed to user, but we will accompany them with some ready-to-use policies to repair the typical inconsistencies. like the sample code attached at the end of this mail. but the point here is that we will not burden the OSD daemon will all of these complicated logic to fix and repair things. and let the magic happen out side of the ceph-osd in a more flexible way. for the advanced users, if they want to explore the possibilities to fix the inconsistencies in their own way, they won't be disappointed also. > >> so librados will offer enough information and facilities, with which a >> smart librados client/script will be able to fix the inconsistencies >> found in the scrub. >> >> as an example, if we run into a data inconsistency where the 3 >> replicas failed to agree with each other after performing a deep >> scrub. probably we'd like to have an election to get the auth copy. >> following pseudo code explains how we will implement this using the >> new rados APIs for scrub and repair. >> >> # something is not necessarily better than nothing >> rados.aio_scrub(pg, completion) >> completion.wait_for_complete() >> for pool in rados.get_inconsistent_pools(): >> for pg in rados.get_inconsistent_pgs(pool): >> # rados.get_inconsistent_pgs() throws if "epoch" expires >> >> for oid, inconsistent in rados.get_inconsistent_pgs(pg, >> epoch).items(): >> if inconsistent.is_data_digest_mismatch(): >> votes = defaultdict(int) >> for osd, shard_info in inconsistent.shards: >> votes[shard_info.object_info.data_digest] += 1 >> digest, _ = mavotes, key=operator.itemgetter(1)) >> auth_copy = None >> for osd, shard_info in inconsistent.shards.items(): >> if shard_info.object_info.data_digest == digest: >> auth_copy = osd >> break >> repair_op = librados.ObjectWriteOperation() >> repair_op.repair_pick(auth_copy, >> inconsistent.ver, epoch) >> rados.aio_operate_scrub(oid, repair_op) >> >> this plan was also discussed in the infernalis CDS. see >> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair. >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards Kefu Chai -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html