How would this work for an ec pool? Maybe the osd argument should be a set of valid peers? -Sam On Thu, May 19, 2016 at 6:09 AM, kefu chai <tchaikov@xxxxxxxxx> wrote: > hi cephers, > > I'd like to keep you guys posted on the progress of the scrub/repair > feature. And also want to valuable comments/suggestions on it from > you! Now, I am working on the repair-write API for the scrub/repair > feature. > > the API looks like: > > /** > * Rewrite the object with the replica hosted by specified osd > * > * @param osd from which OSD we will copy the data > * @param version the version of rewritten object > * @param what the flags indicating what we will copy > */ > int repair_copy(const std::string& oid, uint64_t version, uint32_t > what, int32_t osd, uint32_t epoch); > > in which, > - `version` is the version of the object you expect to be repairing in > case of a racing write; > - `what` is an OR'ed flags of follow enum: > - `epoch` like the other scrub/repairing APIs, epoch indicating the > scrub interval is passed in. > > struct repair_copy_t { > enum { > DATA = 1 << 0, > OMAP = 1 << 1, > ATTR = 1 << 2, > }; > }; > > a new REPAIR_COPY OSD op is introduced to enable the OSD side to copy > the shard/replica from specified source OSD to the acting set. and the > machinery of copy_from is reused to implement this feature. so after > rewriting the object, a version is increased, so that possibly corrupt > copies on down OSDs will get fixed naturally. > > for the code, see > - https://github.com/ceph/ceph/pull/9203 > > for the draft design, see > - http://tracker.ceph.com/issues/13508 > - http://pad.ceph.com/p/scrub_repair > > the API for fixing snapset will be added later. > > On Wed, Nov 11, 2015 at 11:43 PM, kefu chai <tchaikov@xxxxxxxxx> wrote: >> On Wed, Nov 11, 2015 at 10:43 PM, 王志强 <wonzhq@xxxxxxxxx> wrote: >>> 2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@xxxxxxxxx>: >>>> currently, scrub and repair are pretty primitive. there are several >>>> improvements which need to be made: >>>> >>>> - user should be able to initialize scrub of a PG or an object >>>> - int scrub(pg_t, AioCompletion*) >>>> - int scrub(const string& pool, const string& nspace, const >>>> string& locator, const string& oid, AioCompletion*) >>>> - we need a way to query the result of the most recent scrub on a pg. >>>> - int get_inconsistent_pools(set<uint64_t>* pools); >>>> - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs); >>>> - int get_inconsistent(pg_t pgid, epoch_t* cur_interval, >>>> paged<inconsistent_t>*) >>>> - the user should be able to query the content of the replica/shard >>>> objects in the event of an inconsistency. >>>> - operate_on_shard(epoch_t interval, pg_shard_t pg_shard, >>>> ObjectReadOperation *op, bool allow_inconsistent) >>>> - the user should be able to perform following fixes using a new >>>> aio_operate_scrub( >>>> const std::string& oid, >>>> shard_id_t shard, >>>> AioCompletion *c, >>>> ObjectWriteOperation *op) >>>> - specify which replica to use for repairing a content inconsistency >>>> - delete an object if it can't exist >>>> - write_full >>>> - omap_set >>>> - setattrs >>>> - the user should be able to repair snapset and object_info_t >>>> - ObjectWriteOperation::repair_snapset(...) >>>> - set/remove any property/attributes, for example, >>>> - to reset snapset.clone_overlap >>>> - to set snapset.clone_size >>>> - to reset the digests in object_info_t, >>>> - repair will create a new version so that possibly corrupted copies >>>> on down OSDs will get fixed naturally. >>>> >>> >>> I think this exposes too much things to the user. Usually a user >>> doesn't have knowledges like this. If we make it too much complicated, >>> no one will use it at the end. >> >> well, i tend to agree with you to some degree. this is a set of very low >> level APIs exposed to user, but we will accompany them with some >> ready-to-use policies to repair the typical inconsistencies. like the >> sample code attached at the end of this mail. but the point here is >> that we will not burden the OSD daemon will all of these complicated >> logic to fix and repair things. and let the magic happen out side of >> the ceph-osd in a more flexible way. for the advanced users, if they >> want to explore the possibilities to fix the inconsistencies in their own >> way, they won't be disappointed also. >> >>> >>>> so librados will offer enough information and facilities, with which a >>>> smart librados client/script will be able to fix the inconsistencies >>>> found in the scrub. >>>> >>>> as an example, if we run into a data inconsistency where the 3 >>>> replicas failed to agree with each other after performing a deep >>>> scrub. probably we'd like to have an election to get the auth copy. >>>> following pseudo code explains how we will implement this using the >>>> new rados APIs for scrub and repair. >>>> >>>> # something is not necessarily better than nothing >>>> rados.aio_scrub(pg, completion) >>>> completion.wait_for_complete() >>>> for pool in rados.get_inconsistent_pools(): >>>> for pg in rados.get_inconsistent_pgs(pool): >>>> # rados.get_inconsistent_pgs() throws if "epoch" expires >>>> >>>> for oid, inconsistent in rados.get_inconsistent_pgs(pg, >>>> epoch).items(): >>>> if inconsistent.is_data_digest_mismatch(): >>>> votes = defaultdict(int) >>>> for osd, shard_info in inconsistent.shards: >>>> votes[shard_info.object_info.data_digest] += 1 >>>> digest, _ = mavotes, key=operator.itemgetter(1)) >>>> auth_copy = None >>>> for osd, shard_info in inconsistent.shards.items(): >>>> if shard_info.object_info.data_digest == digest: >>>> auth_copy = osd >>>> break >>>> repair_op = librados.ObjectWriteOperation() >>>> repair_op.repair_pick(auth_copy, >>>> inconsistent.ver, epoch) >>>> rados.aio_operate_scrub(oid, repair_op) >>>> >>>> this plan was also discussed in the infernalis CDS. see >>>> http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair. >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Regards >> Kefu Chai > > > > -- > Regards > Kefu Chai > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html