Is clone_overlap used only for the purpose of calculating the changed subsets between clones? Or is it also used for other purpose? On 10 August 2017 at 16:11, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: > By the way, Um..., I'm doing those test on version 10.2.5 > > On 10 August 2017 at 15:06, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >> Hi, Jason. >> >> I did a test, it turned out that, after flushing the object out of the >> cache tier, the clone overlap in base tier changed to empty, as is >> shown below. I think this maybe because that the flush operation just >> mark the whole range of the object being flushed as modified, so if >> the object's size hasn't changed, the overlap becomes empty. Is this >> right? >> >> Thank you:-) >> >> { >> "id": { >> "oid": "test.obj", >> "key": "", >> "snapid": -2, >> "hash": 3575411564, >> "max": 0, >> "pool": 10, >> "namespace": "", >> "max": 0 >> }, >> "info": { >> "oid": { >> "oid": "test.obj", >> "key": "", >> "snapid": -2, >> "hash": 3575411564, >> "max": 0, >> "pool": 10, >> "namespace": "" >> }, >> "version": "4876'9", >> "prior_version": "4854'8", >> "last_reqid": "osd.35.4869:1", >> "user_version": 16, >> "size": 4194303, >> "mtime": "2017-08-10 14:54:56.087387", >> "local_mtime": "2017-08-10 14:59:15.252755", >> "lost": 0, >> "flags": 52, >> "snaps": [], >> "truncate_seq": 0, >> "truncate_size": 0, >> "data_digest": 2827420887, >> "omap_digest": 4294967295, >> "watchers": {} >> }, >> "stat": { >> "size": 4194303, >> "blksize": 4096, >> "blocks": 8200, >> "nlink": 1 >> }, >> "SnapSet": { >> "snap_context": { >> "seq": 3, >> "snaps": [ >> 3 >> ] >> }, >> "head_exists": 1, >> "clones": [ >> { >> "snap": 3, >> "size": 4194303, >> "overlap": "[]" >> } >> ] >> } >> } >> >> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>> If you flush the object out of the cache tier so that its changes are >>> recorded in the base tier, is the overlap correctly recorded in the >>> base tier? >>> >>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>> By the way, according to our test, since the modified range is not >>>> recorded either in the cache tier or in the base tier, I think >>>> proxying the "list-snaps" down to the base tier might not work as >>>> well, is that right? >>>> >>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>> Sorry, I didn't "reply all"....:-) >>>>> >>>>> >>>>> ---------- Forwarded message ---------- >>>>> From: Xuehan Xu <xxhdx1985126@xxxxxxxxx> >>>>> Date: 9 August 2017 at 12:14 >>>>> Subject: Re: About the problem "export_diff relies on clone_overlap, >>>>> which is lost when cache tier is enabled" >>>>> To: dillaman@xxxxxxxxxx >>>>> >>>>> >>>>> Um, no, I don't think they are related. >>>>> >>>>> My problem is this: >>>>> >>>>> At first , we tried to use "rbd export-diff" to do incremental backup >>>>> on Jewel verion ceph cluster which is cache-tier enabled. However, >>>>> when we compare the original rbd image and the backup rbd image, we >>>>> find that they are different. After a series of debugging, we found >>>>> that this is because WRITE ops' "modified_range" is not substracted >>>>> from the clone overlap of its targeting object's HEAD version when >>>>> that object's HEAD verion is in cache iter and its most recent clone >>>>> object is not, which led to the miscalculation of the >>>>> "calc_snap_set_diff" function. >>>>> >>>>> For example, we did such a test: we first made created a snap for an >>>>> rbd image "test.2.img" whose size is only 4MB which means it only >>>>> contains one object; then, we sent a series of AioWrites to >>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving >>>>> its clone object in the base tier only; at that time, we used >>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result >>>>> was as follows: >>>>> >>>>> { >>>>> "id": { >>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>> "key": "", >>>>> "snapid": -2, >>>>> "hash": 2375431681, >>>>> "max": 0, >>>>> "pool": 4, >>>>> "namespace": "", >>>>> "max": 0 >>>>> }, >>>>> "info": { >>>>> "oid": { >>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>> "key": "", >>>>> "snapid": -2, >>>>> "hash": 2375431681, >>>>> "max": 0, >>>>> "pool": 4, >>>>> "namespace": "" >>>>> }, >>>>> "version": "4536'2728", >>>>> "prior_version": "4536'2727", >>>>> "last_reqid": "client.174858.0:10", >>>>> "user_version": 14706, >>>>> "size": 68, >>>>> "mtime": "2017-08-09 11:30:18.263983", >>>>> "local_mtime": "2017-08-09 11:30:18.264310", >>>>> "lost": 0, >>>>> "flags": 4, >>>>> "snaps": [], >>>>> "truncate_seq": 0, >>>>> "truncate_size": 0, >>>>> "data_digest": 4294967295, >>>>> "omap_digest": 4294967295, >>>>> "watchers": {} >>>>> }, >>>>> "stat": { >>>>> "size": 68, >>>>> "blksize": 4096, >>>>> "blocks": 16, >>>>> "nlink": 1 >>>>> }, >>>>> "SnapSet": { >>>>> "snap_context": { >>>>> "seq": 28, >>>>> "snaps": [ >>>>> 28 >>>>> ] >>>>> }, >>>>> "head_exists": 1, >>>>> "clones": [ >>>>> { >>>>> "snap": 28, >>>>> "size": 68, >>>>> "overlap": "[0~64]" >>>>> } >>>>> ] >>>>> } >>>>> } >>>>> >>>>> In this result, we found that the overlap for clone "28" is [0~64]. So >>>>> we specifically sent a AioWrite req to this object to write to the >>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap >>>>> should change to [0~32, 36~64]. However, the result is as follows: >>>>> >>>>> { >>>>> "id": { >>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>> "key": "", >>>>> "snapid": -2, >>>>> "hash": 2375431681, >>>>> "max": 0, >>>>> "pool": 4, >>>>> "namespace": "", >>>>> "max": 0 >>>>> }, >>>>> "info": { >>>>> "oid": { >>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>> "key": "", >>>>> "snapid": -2, >>>>> "hash": 2375431681, >>>>> "max": 0, >>>>> "pool": 4, >>>>> "namespace": "" >>>>> }, >>>>> "version": "4546'2730", >>>>> "prior_version": "4538'2729", >>>>> "last_reqid": "client.155555.0:10", >>>>> "user_version": 14708, >>>>> "size": 4096, >>>>> "mtime": "2017-08-09 11:36:20.717910", >>>>> "local_mtime": "2017-08-09 11:36:20.719039", >>>>> "lost": 0, >>>>> "flags": 4, >>>>> "snaps": [], >>>>> "truncate_seq": 0, >>>>> "truncate_size": 0, >>>>> "data_digest": 4294967295, >>>>> "omap_digest": 4294967295, >>>>> "watchers": {} >>>>> }, >>>>> "stat": { >>>>> "size": 4096, >>>>> "blksize": 4096, >>>>> "blocks": 16, >>>>> "nlink": 1 >>>>> }, >>>>> "SnapSet": { >>>>> "snap_context": { >>>>> "seq": 28, >>>>> "snaps": [ >>>>> 28 >>>>> ] >>>>> }, >>>>> "head_exists": 1, >>>>> "clones": [ >>>>> { >>>>> "snap": 28, >>>>> "size": 68, >>>>> "overlap": "[0~64]" >>>>> } >>>>> ] >>>>> } >>>>> } >>>>> >>>>> It's obvious that it didn't change at all. If we do export-diff under >>>>> this circumstance, the result would be wrong. Meanwhile, in the base >>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the >>>>> overlap recorded in the base tier didn't change also: >>>>> { >>>>> "id": { >>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>> "key": "", >>>>> "snapid": -2, >>>>> "hash": 2375431681, >>>>> "max": 0, >>>>> "pool": 3, >>>>> "namespace": "", >>>>> "max": 0 >>>>> }, >>>>> "info": { >>>>> "oid": { >>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>> "key": "", >>>>> "snapid": -2, >>>>> "hash": 2375431681, >>>>> "max": 0, >>>>> "pool": 3, >>>>> "namespace": "" >>>>> }, >>>>> "version": "4536'14459", >>>>> "prior_version": "4536'14458", >>>>> "last_reqid": "client.174834.0:10", >>>>> "user_version": 14648, >>>>> "size": 68, >>>>> "mtime": "2017-08-09 11:30:01.412634", >>>>> "local_mtime": "2017-08-09 11:30:01.413614", >>>>> "lost": 0, >>>>> "flags": 36, >>>>> "snaps": [], >>>>> "truncate_seq": 0, >>>>> "truncate_size": 0, >>>>> "data_digest": 4294967295, >>>>> "omap_digest": 4294967295, >>>>> "watchers": {} >>>>> }, >>>>> "stat": { >>>>> "size": 68, >>>>> "blksize": 4096, >>>>> "blocks": 16, >>>>> "nlink": 1 >>>>> }, >>>>> "SnapSet": { >>>>> "snap_context": { >>>>> "seq": 28, >>>>> "snaps": [ >>>>> 28 >>>>> ] >>>>> }, >>>>> "head_exists": 1, >>>>> "clones": [ >>>>> { >>>>> "snap": 28, >>>>> "size": 68, >>>>> "overlap": "[0~64]" >>>>> } >>>>> ] >>>>> } >>>>> } >>>>> >>>>> Then we turn to the source code to find the reason for this. And we >>>>> found that, the reason should be that, in the >>>>> ReplicatedPG::make_writeable method, when determining whether the >>>>> write op's modified range should be subtracted from the clone overlap, >>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0" >>>>> and "is_present_clone(last_clone_oid)", as the code below shows. >>>>> >>>>> // update most recent clone_overlap and usage stats >>>>> if (ctx->new_snapset.clones.size() > 0) { >>>>> /* we need to check whether the most recent clone exists, if it's >>>>> been evicted, >>>>> * it's not included in the stats */ >>>>> hobject_t last_clone_oid = soid; >>>>> last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; >>>>> if (is_present_clone(last_clone_oid)) { >>>>> interval_set<uint64_t> &newest_overlap = >>>>> ctx->new_snapset.clone_overlap.rbegin()->second; >>>>> ctx->modified_ranges.intersection_of(newest_overlap); >>>>> // modified_ranges is still in use by the clone >>>>> add_interval_usage(ctx->modified_ranges, ctx->delta_stats); >>>>> newest_overlap.subtract(ctx->modified_ranges); >>>>> } >>>>> } >>>>> >>>>> We thought that the latter condition check >>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a >>>>> judgement base for the determination of whether to subtract the clone >>>>> overlap with write ops' modified range, so we changed to code above to >>>>> the following, which moved the subtraction out of the latter condition >>>>> check, and submitted a pr for this: >>>>> https://github.com/ceph/ceph/pull/16790: >>>>> >>>>> // update most recent clone_overlap and usage stats >>>>> if (ctx->new_snapset.clones.size() > 0) { >>>>> /* we need to check whether the most recent clone exists, if it's >>>>> been evicted, >>>>> * it's not included in the stats */ >>>>> hobject_t last_clone_oid = soid; >>>>> last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; >>>>> interval_set<uint64_t> &newest_overlap = >>>>> ctx->new_snapset.clone_overlap.rbegin()->second; >>>>> ctx->modified_ranges.intersection_of(newest_overlap); >>>>> newest_overlap.subtract(ctx->modified_ranges); >>>>> >>>>> if (is_present_clone(last_clone_oid)) { >>>>> // modified_ranges is still in use by the clone >>>>> add_interval_usage(ctx->modified_ranges, ctx->delta_stats); >>>>> } >>>>> } >>>>> >>>>> In our test, this change solved the problem successfully, however, we >>>>> can't confirm that this change won't cause any new problems. So, here >>>>> we are discussing how to solve this problem:-) >>>>> >>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626? >>>>>> >>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>> OK, I'll try that. Thank you:-) >>>>>>> >>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to >>>>>>>> the base tier and combine the cache tier + base tier results? Reading >>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around >>>>>>>> this within librbd by just refusing the provide intra-object diffs if >>>>>>>> cache tiering is in-use. >>>>>>>> >>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>>> Hi, everyone. >>>>>>>>> >>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I >>>>>>>>> just did another test: I did some writes to an object >>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object >>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40 >>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to >>>>>>>>> dump its "HEAD" version in the base tier, the result is as >>>>>>>>> follows(before I raise it to cache tier, there is three snaps the >>>>>>>>> latest of which is 26): >>>>>>>>> >>>>>>>>> { >>>>>>>>> "id": { >>>>>>>>> "oid": "rbd_data.1ebc6238e1f29.0000000000000000", >>>>>>>>> "key": "", >>>>>>>>> "snapid": -2, >>>>>>>>> "hash": 1655893237, >>>>>>>>> "max": 0, >>>>>>>>> "pool": 3, >>>>>>>>> "namespace": "", >>>>>>>>> "max": 0 >>>>>>>>> }, >>>>>>>>> "info": { >>>>>>>>> "oid": { >>>>>>>>> "oid": "rbd_data.1ebc6238e1f29.0000000000000000", >>>>>>>>> "key": "", >>>>>>>>> "snapid": -2, >>>>>>>>> "hash": 1655893237, >>>>>>>>> "max": 0, >>>>>>>>> "pool": 3, >>>>>>>>> "namespace": "" >>>>>>>>> }, >>>>>>>>> "version": "4219'16423", >>>>>>>>> "prior_version": "3978'16310", >>>>>>>>> "last_reqid": "osd.70.4213:2359", >>>>>>>>> "user_version": 17205, >>>>>>>>> "size": 4194304, >>>>>>>>> "mtime": "2017-08-03 22:07:34.656122", >>>>>>>>> "local_mtime": "2017-08-05 23:02:33.628734", >>>>>>>>> "lost": 0, >>>>>>>>> "flags": 52, >>>>>>>>> "snaps": [], >>>>>>>>> "truncate_seq": 0, >>>>>>>>> "truncate_size": 0, >>>>>>>>> "data_digest": 2822203961, >>>>>>>>> "omap_digest": 4294967295, >>>>>>>>> "watchers": {} >>>>>>>>> }, >>>>>>>>> "stat": { >>>>>>>>> "size": 4194304, >>>>>>>>> "blksize": 4096, >>>>>>>>> "blocks": 8200, >>>>>>>>> "nlink": 1 >>>>>>>>> }, >>>>>>>>> "SnapSet": { >>>>>>>>> "snap_context": { >>>>>>>>> "seq": 26, >>>>>>>>> "snaps": [ >>>>>>>>> 26, >>>>>>>>> 25, >>>>>>>>> 16 >>>>>>>>> ] >>>>>>>>> }, >>>>>>>>> "head_exists": 1, >>>>>>>>> "clones": [ >>>>>>>>> { >>>>>>>>> "snap": 16, >>>>>>>>> "size": 4194304, >>>>>>>>> "overlap": "[4~4194300]" >>>>>>>>> }, >>>>>>>>> { >>>>>>>>> "snap": 25, >>>>>>>>> "size": 4194304, >>>>>>>>> "overlap": "[]" >>>>>>>>> }, >>>>>>>>> { >>>>>>>>> "snap": 26, >>>>>>>>> "size": 4194304, >>>>>>>>> "overlap": "[]" >>>>>>>>> } >>>>>>>>> ] >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which, >>>>>>>>> combining with the previous test described in >>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified >>>>>>>>> range" is neither recorded in the cache tier nor in the base tier. >>>>>>>>> >>>>>>>>> I think maybe we really should move the clone overlap modification out >>>>>>>>> of the IF block which has the condition check "is_present_clone". As >>>>>>>>> for now, I can't see any other way to fix this problem. >>>>>>>>> >>>>>>>>> Am I right about this? >>>>>>>>> >>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>>>> I mean I think it's the condition check "is_present_clone" that >>>>>>>>>> prevent the clone overlap to record the client write operations >>>>>>>>>> modified range when the target "HEAD" object exists without its most >>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap >>>>>>>>>> modification out of the "is_present_clone" condition check block is >>>>>>>>>> enough to solve this case, just like the PR >>>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause >>>>>>>>>> other problems. >>>>>>>>>> >>>>>>>>>> In our test, this fix solved the problem successfully, however, we >>>>>>>>>> can't confirm it won't cause new problems yet. >>>>>>>>>> >>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-) >>>>>>>>>> >>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>>>>> Hi, grep:-) >>>>>>>>>>> >>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790. >>>>>>>>>>> >>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking >>>>>>>>>>> which data is the same on disk". >>>>>>>>>>> >>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an >>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the >>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to >>>>>>>>>>> a "present clone" in cache tier, as long as >>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one >>>>>>>>>>> such clone object in the base tier. And, as long as >>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one" >>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing >>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is >>>>>>>>>>> enough to make the judgement that the clone object exists. >>>>>>>>>>> >>>>>>>>>>> So, if I'm right, passing the condition check "if >>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do >>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to >>>>>>>>>>> pass "is_present_clone". >>>>>>>>>>> >>>>>>>>>>> Am I right about this? Or am I missing anything? >>>>>>>>>>> >>>>>>>>>>> Please help us, thank you:-) >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jason >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jason >>> >>> >>> >>> -- >>> Jason -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html