2015-11-11 19:44 GMT+08:00 kefu chai <tchaikov@xxxxxxxxx>: > currently, scrub and repair are pretty primitive. there are several > improvements which need to be made: > > - user should be able to initialize scrub of a PG or an object > - int scrub(pg_t, AioCompletion*) > - int scrub(const string& pool, const string& nspace, const > string& locator, const string& oid, AioCompletion*) > - we need a way to query the result of the most recent scrub on a pg. > - int get_inconsistent_pools(set<uint64_t>* pools); > - int get_inconsistent_pgs(uint64_t pool, paged<pg_t>* pgs); > - int get_inconsistent(pg_t pgid, epoch_t* cur_interval, > paged<inconsistent_t>*) > - the user should be able to query the content of the replica/shard > objects in the event of an inconsistency. > - operate_on_shard(epoch_t interval, pg_shard_t pg_shard, > ObjectReadOperation *op, bool allow_inconsistent) > - the user should be able to perform following fixes using a new > aio_operate_scrub( > const std::string& oid, > shard_id_t shard, > AioCompletion *c, > ObjectWriteOperation *op) > - specify which replica to use for repairing a content inconsistency > - delete an object if it can't exist > - write_full > - omap_set > - setattrs > - the user should be able to repair snapset and object_info_t > - ObjectWriteOperation::repair_snapset(...) > - set/remove any property/attributes, for example, > - to reset snapset.clone_overlap > - to set snapset.clone_size > - to reset the digests in object_info_t, > - repair will create a new version so that possibly corrupted copies > on down OSDs will get fixed naturally. > I think this exposes too much things to the user. Usually a user doesn't have knowledges like this. If we make it too much complicated, no one will use it at the end. > so librados will offer enough information and facilities, with which a > smart librados client/script will be able to fix the inconsistencies > found in the scrub. > > as an example, if we run into a data inconsistency where the 3 > replicas failed to agree with each other after performing a deep > scrub. probably we'd like to have an election to get the auth copy. > following pseudo code explains how we will implement this using the > new rados APIs for scrub and repair. > > # something is not necessarily better than nothing > rados.aio_scrub(pg, completion) > completion.wait_for_complete() > for pool in rados.get_inconsistent_pools(): > for pg in rados.get_inconsistent_pgs(pool): > # rados.get_inconsistent_pgs() throws if "epoch" expires > > for oid, inconsistent in rados.get_inconsistent_pgs(pg, > epoch).items(): > if inconsistent.is_data_digest_mismatch(): > votes = defaultdict(int) > for osd, shard_info in inconsistent.shards: > votes[shard_info.object_info.data_digest] += 1 > digest, _ = mavotes, key=operator.itemgetter(1)) > auth_copy = None > for osd, shard_info in inconsistent.shards.items(): > if shard_info.object_info.data_digest == digest: > auth_copy = osd > break > repair_op = librados.ObjectWriteOperation() > repair_op.repair_pick(auth_copy, > inconsistent.ver, epoch) > rados.aio_operate_scrub(oid, repair_op) > > this plan was also discussed in the infernalis CDS. see > http://tracker.ceph.com/projects/ceph/wiki/Osd_-_Scrub_and_Repair. > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html