By the way, according to our test, since the modified range is not recorded either in the cache tier or in the base tier, I think proxying the "list-snaps" down to the base tier might not work as well, is that right? On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: > Sorry, I didn't "reply all"....:-) > > > ---------- Forwarded message ---------- > From: Xuehan Xu <xxhdx1985126@xxxxxxxxx> > Date: 9 August 2017 at 12:14 > Subject: Re: About the problem "export_diff relies on clone_overlap, > which is lost when cache tier is enabled" > To: dillaman@xxxxxxxxxx > > > Um, no, I don't think they are related. > > My problem is this: > > At first , we tried to use "rbd export-diff" to do incremental backup > on Jewel verion ceph cluster which is cache-tier enabled. However, > when we compare the original rbd image and the backup rbd image, we > find that they are different. After a series of debugging, we found > that this is because WRITE ops' "modified_range" is not substracted > from the clone overlap of its targeting object's HEAD version when > that object's HEAD verion is in cache iter and its most recent clone > object is not, which led to the miscalculation of the > "calc_snap_set_diff" function. > > For example, we did such a test: we first made created a snap for an > rbd image "test.2.img" whose size is only 4MB which means it only > contains one object; then, we sent a series of AioWrites to > "test.2.img" to promote its HEAD object into cache tier, while leaving > its clone object in the base tier only; at that time, we used > "ceph-objectstore-tool" to dump the object we wrote to, and the result > was as follows: > > { > "id": { > "oid": "rbd_data.2aae62ae8944a.0000000000000000", > "key": "", > "snapid": -2, > "hash": 2375431681, > "max": 0, > "pool": 4, > "namespace": "", > "max": 0 > }, > "info": { > "oid": { > "oid": "rbd_data.2aae62ae8944a.0000000000000000", > "key": "", > "snapid": -2, > "hash": 2375431681, > "max": 0, > "pool": 4, > "namespace": "" > }, > "version": "4536'2728", > "prior_version": "4536'2727", > "last_reqid": "client.174858.0:10", > "user_version": 14706, > "size": 68, > "mtime": "2017-08-09 11:30:18.263983", > "local_mtime": "2017-08-09 11:30:18.264310", > "lost": 0, > "flags": 4, > "snaps": [], > "truncate_seq": 0, > "truncate_size": 0, > "data_digest": 4294967295, > "omap_digest": 4294967295, > "watchers": {} > }, > "stat": { > "size": 68, > "blksize": 4096, > "blocks": 16, > "nlink": 1 > }, > "SnapSet": { > "snap_context": { > "seq": 28, > "snaps": [ > 28 > ] > }, > "head_exists": 1, > "clones": [ > { > "snap": 28, > "size": 68, > "overlap": "[0~64]" > } > ] > } > } > > In this result, we found that the overlap for clone "28" is [0~64]. So > we specifically sent a AioWrite req to this object to write to the > offset 32 with 4 bytes of ramdon data, which we thought the overlap > should change to [0~32, 36~64]. However, the result is as follows: > > { > "id": { > "oid": "rbd_data.2aae62ae8944a.0000000000000000", > "key": "", > "snapid": -2, > "hash": 2375431681, > "max": 0, > "pool": 4, > "namespace": "", > "max": 0 > }, > "info": { > "oid": { > "oid": "rbd_data.2aae62ae8944a.0000000000000000", > "key": "", > "snapid": -2, > "hash": 2375431681, > "max": 0, > "pool": 4, > "namespace": "" > }, > "version": "4546'2730", > "prior_version": "4538'2729", > "last_reqid": "client.155555.0:10", > "user_version": 14708, > "size": 4096, > "mtime": "2017-08-09 11:36:20.717910", > "local_mtime": "2017-08-09 11:36:20.719039", > "lost": 0, > "flags": 4, > "snaps": [], > "truncate_seq": 0, > "truncate_size": 0, > "data_digest": 4294967295, > "omap_digest": 4294967295, > "watchers": {} > }, > "stat": { > "size": 4096, > "blksize": 4096, > "blocks": 16, > "nlink": 1 > }, > "SnapSet": { > "snap_context": { > "seq": 28, > "snaps": [ > 28 > ] > }, > "head_exists": 1, > "clones": [ > { > "snap": 28, > "size": 68, > "overlap": "[0~64]" > } > ] > } > } > > It's obvious that it didn't change at all. If we do export-diff under > this circumstance, the result would be wrong. Meanwhile, in the base > tier, the "ceph-objectstore-tool" dump's result also showed that the > overlap recorded in the base tier didn't change also: > { > "id": { > "oid": "rbd_data.2aae62ae8944a.0000000000000000", > "key": "", > "snapid": -2, > "hash": 2375431681, > "max": 0, > "pool": 3, > "namespace": "", > "max": 0 > }, > "info": { > "oid": { > "oid": "rbd_data.2aae62ae8944a.0000000000000000", > "key": "", > "snapid": -2, > "hash": 2375431681, > "max": 0, > "pool": 3, > "namespace": "" > }, > "version": "4536'14459", > "prior_version": "4536'14458", > "last_reqid": "client.174834.0:10", > "user_version": 14648, > "size": 68, > "mtime": "2017-08-09 11:30:01.412634", > "local_mtime": "2017-08-09 11:30:01.413614", > "lost": 0, > "flags": 36, > "snaps": [], > "truncate_seq": 0, > "truncate_size": 0, > "data_digest": 4294967295, > "omap_digest": 4294967295, > "watchers": {} > }, > "stat": { > "size": 68, > "blksize": 4096, > "blocks": 16, > "nlink": 1 > }, > "SnapSet": { > "snap_context": { > "seq": 28, > "snaps": [ > 28 > ] > }, > "head_exists": 1, > "clones": [ > { > "snap": 28, > "size": 68, > "overlap": "[0~64]" > } > ] > } > } > > Then we turn to the source code to find the reason for this. And we > found that, the reason should be that, in the > ReplicatedPG::make_writeable method, when determining whether the > write op's modified range should be subtracted from the clone overlap, > it has pass two condition check: "ctx->new_snapset.clones.size() > 0" > and "is_present_clone(last_clone_oid)", as the code below shows. > > // update most recent clone_overlap and usage stats > if (ctx->new_snapset.clones.size() > 0) { > /* we need to check whether the most recent clone exists, if it's > been evicted, > * it's not included in the stats */ > hobject_t last_clone_oid = soid; > last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; > if (is_present_clone(last_clone_oid)) { > interval_set<uint64_t> &newest_overlap = > ctx->new_snapset.clone_overlap.rbegin()->second; > ctx->modified_ranges.intersection_of(newest_overlap); > // modified_ranges is still in use by the clone > add_interval_usage(ctx->modified_ranges, ctx->delta_stats); > newest_overlap.subtract(ctx->modified_ranges); > } > } > > We thought that the latter condition check > "is_present_clone(last_clone_oid)" might not be reasonable to be a > judgement base for the determination of whether to subtract the clone > overlap with write ops' modified range, so we changed to code above to > the following, which moved the subtraction out of the latter condition > check, and submitted a pr for this: > https://github.com/ceph/ceph/pull/16790: > > // update most recent clone_overlap and usage stats > if (ctx->new_snapset.clones.size() > 0) { > /* we need to check whether the most recent clone exists, if it's > been evicted, > * it's not included in the stats */ > hobject_t last_clone_oid = soid; > last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; > interval_set<uint64_t> &newest_overlap = > ctx->new_snapset.clone_overlap.rbegin()->second; > ctx->modified_ranges.intersection_of(newest_overlap); > newest_overlap.subtract(ctx->modified_ranges); > > if (is_present_clone(last_clone_oid)) { > // modified_ranges is still in use by the clone > add_interval_usage(ctx->modified_ranges, ctx->delta_stats); > } > } > > In our test, this change solved the problem successfully, however, we > can't confirm that this change won't cause any new problems. So, here > we are discussing how to solve this problem:-) > > On 9 August 2017 at 09:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >> Is this issue related to https://github.com/ceph/ceph/pull/10626? >> >> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>> OK, I'll try that. Thank you:-) >>> >>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>>> Could you just proxy the "list snaps" op from the cache tier down to >>>> the base tier and combine the cache tier + base tier results? Reading >>>> the associated ticket, it seems kludgy to me to attempt to work around >>>> this within librbd by just refusing the provide intra-object diffs if >>>> cache tiering is in-use. >>>> >>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>> Hi, everyone. >>>>> >>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I >>>>> just did another test: I did some writes to an object >>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object >>>>> to the cache tier, after which I specifically write to its offset 0x40 >>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to >>>>> dump its "HEAD" version in the base tier, the result is as >>>>> follows(before I raise it to cache tier, there is three snaps the >>>>> latest of which is 26): >>>>> >>>>> { >>>>> "id": { >>>>> "oid": "rbd_data.1ebc6238e1f29.0000000000000000", >>>>> "key": "", >>>>> "snapid": -2, >>>>> "hash": 1655893237, >>>>> "max": 0, >>>>> "pool": 3, >>>>> "namespace": "", >>>>> "max": 0 >>>>> }, >>>>> "info": { >>>>> "oid": { >>>>> "oid": "rbd_data.1ebc6238e1f29.0000000000000000", >>>>> "key": "", >>>>> "snapid": -2, >>>>> "hash": 1655893237, >>>>> "max": 0, >>>>> "pool": 3, >>>>> "namespace": "" >>>>> }, >>>>> "version": "4219'16423", >>>>> "prior_version": "3978'16310", >>>>> "last_reqid": "osd.70.4213:2359", >>>>> "user_version": 17205, >>>>> "size": 4194304, >>>>> "mtime": "2017-08-03 22:07:34.656122", >>>>> "local_mtime": "2017-08-05 23:02:33.628734", >>>>> "lost": 0, >>>>> "flags": 52, >>>>> "snaps": [], >>>>> "truncate_seq": 0, >>>>> "truncate_size": 0, >>>>> "data_digest": 2822203961, >>>>> "omap_digest": 4294967295, >>>>> "watchers": {} >>>>> }, >>>>> "stat": { >>>>> "size": 4194304, >>>>> "blksize": 4096, >>>>> "blocks": 8200, >>>>> "nlink": 1 >>>>> }, >>>>> "SnapSet": { >>>>> "snap_context": { >>>>> "seq": 26, >>>>> "snaps": [ >>>>> 26, >>>>> 25, >>>>> 16 >>>>> ] >>>>> }, >>>>> "head_exists": 1, >>>>> "clones": [ >>>>> { >>>>> "snap": 16, >>>>> "size": 4194304, >>>>> "overlap": "[4~4194300]" >>>>> }, >>>>> { >>>>> "snap": 25, >>>>> "size": 4194304, >>>>> "overlap": "[]" >>>>> }, >>>>> { >>>>> "snap": 26, >>>>> "size": 4194304, >>>>> "overlap": "[]" >>>>> } >>>>> ] >>>>> } >>>>> } >>>>> >>>>> As we can see, its clone_overlap for snap 26 is empty, which, >>>>> combining with the previous test described in >>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified >>>>> range" is neither recorded in the cache tier nor in the base tier. >>>>> >>>>> I think maybe we really should move the clone overlap modification out >>>>> of the IF block which has the condition check "is_present_clone". As >>>>> for now, I can't see any other way to fix this problem. >>>>> >>>>> Am I right about this? >>>>> >>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>> I mean I think it's the condition check "is_present_clone" that >>>>>> prevent the clone overlap to record the client write operations >>>>>> modified range when the target "HEAD" object exists without its most >>>>>> recent clone object, and if I'm right, just move the clone overlap >>>>>> modification out of the "is_present_clone" condition check block is >>>>>> enough to solve this case, just like the PR >>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause >>>>>> other problems. >>>>>> >>>>>> In our test, this fix solved the problem successfully, however, we >>>>>> can't confirm it won't cause new problems yet. >>>>>> >>>>>> So if anyone see this and knows the answer, please help us. Thank you:-) >>>>>> >>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>> Hi, grep:-) >>>>>>> >>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790. >>>>>>> >>>>>>> I agree with you in that " clone overlap is supposed to be tracking >>>>>>> which data is the same on disk". >>>>>>> >>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an >>>>>>> indicator about whether there are clone objects on disk, so, in the >>>>>>> scenario of "cache tier", although a clone oid does not corresponds to >>>>>>> a "present clone" in cache tier, as long as >>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one >>>>>>> such clone object in the base tier. And, as long as >>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one" >>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing >>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is >>>>>>> enough to make the judgement that the clone object exists. >>>>>>> >>>>>>> So, if I'm right, passing the condition check "if >>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do >>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to >>>>>>> pass "is_present_clone". >>>>>>> >>>>>>> Am I right about this? Or am I missing anything? >>>>>>> >>>>>>> Please help us, thank you:-) >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> -- >>>> Jason >> >> >> >> -- >> Jason -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html