I cannot speak to the design principles behind this since I didn't write it. Whatever the solution is, the results should be consistent regardless of the current promotion/eviction/clean/dirty status of an object between the base and cache tier. For example, if the cache tier always performs a full object write when it writes back to the base tier, the base tier clone overlap would show zero overlap, so it would be inconsistent for the cache tier to show an accurate clone overlap up until the point the object is evicted. On Thu, Aug 10, 2017 at 11:11 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: > Yes, that's right. However, as you say, that doesn't seem to be very > efficient, and there strict time window constrains to our backup > tasks. > > On the other hand, I still don't quite understand why we can't just > simply move the clone overlap subtraction out of the > "is_present_clone" condition check in the PrimaryLogPG::make_writeable > method. I've searched and read all the source codes that are related > to the clone overlap, it seems that clone overlap is only used for > calculating the difference between clones or object subsets that needs > to be recovered or backfilled. If this is true, wouldn't it be > reasonable to keep the overlap accurate no matter whether the clone > objects are in the cache tier or the base tier? Why leave it > incomplete when clone objects are not in the cache tier? > > Could you show us some more design principles behind this, please? > Maybe we can figure out a better solution if we can fully understand > the design intentions:-) > > Thanks for helping us:-) > > On 10 August 2017 at 21:11, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >> Yes, it's quite possible that for simplicity sake the cache tier OSD >> writes back the full object to the base tier even if only a single >> byte has changed. In that case, while not optimal behavior, at least >> if you combine the base + cache tier snapsets together, RBD should be >> able to export enough data to reconstruct the image. >> >> On Thu, Aug 10, 2017 at 3:06 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>> Hi, Jason. >>> >>> I did a test, it turned out that, after flushing the object out of the >>> cache tier, the clone overlap in base tier changed to empty, as is >>> shown below. I think this maybe because that the flush operation just >>> mark the whole range of the object being flushed as modified, so if >>> the object's size hasn't changed, the overlap becomes empty. Is this >>> right? >>> >>> Thank you:-) >>> >>> { >>> "id": { >>> "oid": "test.obj", >>> "key": "", >>> "snapid": -2, >>> "hash": 3575411564, >>> "max": 0, >>> "pool": 10, >>> "namespace": "", >>> "max": 0 >>> }, >>> "info": { >>> "oid": { >>> "oid": "test.obj", >>> "key": "", >>> "snapid": -2, >>> "hash": 3575411564, >>> "max": 0, >>> "pool": 10, >>> "namespace": "" >>> }, >>> "version": "4876'9", >>> "prior_version": "4854'8", >>> "last_reqid": "osd.35.4869:1", >>> "user_version": 16, >>> "size": 4194303, >>> "mtime": "2017-08-10 14:54:56.087387", >>> "local_mtime": "2017-08-10 14:59:15.252755", >>> "lost": 0, >>> "flags": 52, >>> "snaps": [], >>> "truncate_seq": 0, >>> "truncate_size": 0, >>> "data_digest": 2827420887, >>> "omap_digest": 4294967295, >>> "watchers": {} >>> }, >>> "stat": { >>> "size": 4194303, >>> "blksize": 4096, >>> "blocks": 8200, >>> "nlink": 1 >>> }, >>> "SnapSet": { >>> "snap_context": { >>> "seq": 3, >>> "snaps": [ >>> 3 >>> ] >>> }, >>> "head_exists": 1, >>> "clones": [ >>> { >>> "snap": 3, >>> "size": 4194303, >>> "overlap": "[]" >>> } >>> ] >>> } >>> } >>> >>> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>>> If you flush the object out of the cache tier so that its changes are >>>> recorded in the base tier, is the overlap correctly recorded in the >>>> base tier? >>>> >>>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>> By the way, according to our test, since the modified range is not >>>>> recorded either in the cache tier or in the base tier, I think >>>>> proxying the "list-snaps" down to the base tier might not work as >>>>> well, is that right? >>>>> >>>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>> Sorry, I didn't "reply all"....:-) >>>>>> >>>>>> >>>>>> ---------- Forwarded message ---------- >>>>>> From: Xuehan Xu <xxhdx1985126@xxxxxxxxx> >>>>>> Date: 9 August 2017 at 12:14 >>>>>> Subject: Re: About the problem "export_diff relies on clone_overlap, >>>>>> which is lost when cache tier is enabled" >>>>>> To: dillaman@xxxxxxxxxx >>>>>> >>>>>> >>>>>> Um, no, I don't think they are related. >>>>>> >>>>>> My problem is this: >>>>>> >>>>>> At first , we tried to use "rbd export-diff" to do incremental backup >>>>>> on Jewel verion ceph cluster which is cache-tier enabled. However, >>>>>> when we compare the original rbd image and the backup rbd image, we >>>>>> find that they are different. After a series of debugging, we found >>>>>> that this is because WRITE ops' "modified_range" is not substracted >>>>>> from the clone overlap of its targeting object's HEAD version when >>>>>> that object's HEAD verion is in cache iter and its most recent clone >>>>>> object is not, which led to the miscalculation of the >>>>>> "calc_snap_set_diff" function. >>>>>> >>>>>> For example, we did such a test: we first made created a snap for an >>>>>> rbd image "test.2.img" whose size is only 4MB which means it only >>>>>> contains one object; then, we sent a series of AioWrites to >>>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving >>>>>> its clone object in the base tier only; at that time, we used >>>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result >>>>>> was as follows: >>>>>> >>>>>> { >>>>>> "id": { >>>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>>> "key": "", >>>>>> "snapid": -2, >>>>>> "hash": 2375431681, >>>>>> "max": 0, >>>>>> "pool": 4, >>>>>> "namespace": "", >>>>>> "max": 0 >>>>>> }, >>>>>> "info": { >>>>>> "oid": { >>>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>>> "key": "", >>>>>> "snapid": -2, >>>>>> "hash": 2375431681, >>>>>> "max": 0, >>>>>> "pool": 4, >>>>>> "namespace": "" >>>>>> }, >>>>>> "version": "4536'2728", >>>>>> "prior_version": "4536'2727", >>>>>> "last_reqid": "client.174858.0:10", >>>>>> "user_version": 14706, >>>>>> "size": 68, >>>>>> "mtime": "2017-08-09 11:30:18.263983", >>>>>> "local_mtime": "2017-08-09 11:30:18.264310", >>>>>> "lost": 0, >>>>>> "flags": 4, >>>>>> "snaps": [], >>>>>> "truncate_seq": 0, >>>>>> "truncate_size": 0, >>>>>> "data_digest": 4294967295, >>>>>> "omap_digest": 4294967295, >>>>>> "watchers": {} >>>>>> }, >>>>>> "stat": { >>>>>> "size": 68, >>>>>> "blksize": 4096, >>>>>> "blocks": 16, >>>>>> "nlink": 1 >>>>>> }, >>>>>> "SnapSet": { >>>>>> "snap_context": { >>>>>> "seq": 28, >>>>>> "snaps": [ >>>>>> 28 >>>>>> ] >>>>>> }, >>>>>> "head_exists": 1, >>>>>> "clones": [ >>>>>> { >>>>>> "snap": 28, >>>>>> "size": 68, >>>>>> "overlap": "[0~64]" >>>>>> } >>>>>> ] >>>>>> } >>>>>> } >>>>>> >>>>>> In this result, we found that the overlap for clone "28" is [0~64]. So >>>>>> we specifically sent a AioWrite req to this object to write to the >>>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap >>>>>> should change to [0~32, 36~64]. However, the result is as follows: >>>>>> >>>>>> { >>>>>> "id": { >>>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>>> "key": "", >>>>>> "snapid": -2, >>>>>> "hash": 2375431681, >>>>>> "max": 0, >>>>>> "pool": 4, >>>>>> "namespace": "", >>>>>> "max": 0 >>>>>> }, >>>>>> "info": { >>>>>> "oid": { >>>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>>> "key": "", >>>>>> "snapid": -2, >>>>>> "hash": 2375431681, >>>>>> "max": 0, >>>>>> "pool": 4, >>>>>> "namespace": "" >>>>>> }, >>>>>> "version": "4546'2730", >>>>>> "prior_version": "4538'2729", >>>>>> "last_reqid": "client.155555.0:10", >>>>>> "user_version": 14708, >>>>>> "size": 4096, >>>>>> "mtime": "2017-08-09 11:36:20.717910", >>>>>> "local_mtime": "2017-08-09 11:36:20.719039", >>>>>> "lost": 0, >>>>>> "flags": 4, >>>>>> "snaps": [], >>>>>> "truncate_seq": 0, >>>>>> "truncate_size": 0, >>>>>> "data_digest": 4294967295, >>>>>> "omap_digest": 4294967295, >>>>>> "watchers": {} >>>>>> }, >>>>>> "stat": { >>>>>> "size": 4096, >>>>>> "blksize": 4096, >>>>>> "blocks": 16, >>>>>> "nlink": 1 >>>>>> }, >>>>>> "SnapSet": { >>>>>> "snap_context": { >>>>>> "seq": 28, >>>>>> "snaps": [ >>>>>> 28 >>>>>> ] >>>>>> }, >>>>>> "head_exists": 1, >>>>>> "clones": [ >>>>>> { >>>>>> "snap": 28, >>>>>> "size": 68, >>>>>> "overlap": "[0~64]" >>>>>> } >>>>>> ] >>>>>> } >>>>>> } >>>>>> >>>>>> It's obvious that it didn't change at all. If we do export-diff under >>>>>> this circumstance, the result would be wrong. Meanwhile, in the base >>>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the >>>>>> overlap recorded in the base tier didn't change also: >>>>>> { >>>>>> "id": { >>>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>>> "key": "", >>>>>> "snapid": -2, >>>>>> "hash": 2375431681, >>>>>> "max": 0, >>>>>> "pool": 3, >>>>>> "namespace": "", >>>>>> "max": 0 >>>>>> }, >>>>>> "info": { >>>>>> "oid": { >>>>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>>>> "key": "", >>>>>> "snapid": -2, >>>>>> "hash": 2375431681, >>>>>> "max": 0, >>>>>> "pool": 3, >>>>>> "namespace": "" >>>>>> }, >>>>>> "version": "4536'14459", >>>>>> "prior_version": "4536'14458", >>>>>> "last_reqid": "client.174834.0:10", >>>>>> "user_version": 14648, >>>>>> "size": 68, >>>>>> "mtime": "2017-08-09 11:30:01.412634", >>>>>> "local_mtime": "2017-08-09 11:30:01.413614", >>>>>> "lost": 0, >>>>>> "flags": 36, >>>>>> "snaps": [], >>>>>> "truncate_seq": 0, >>>>>> "truncate_size": 0, >>>>>> "data_digest": 4294967295, >>>>>> "omap_digest": 4294967295, >>>>>> "watchers": {} >>>>>> }, >>>>>> "stat": { >>>>>> "size": 68, >>>>>> "blksize": 4096, >>>>>> "blocks": 16, >>>>>> "nlink": 1 >>>>>> }, >>>>>> "SnapSet": { >>>>>> "snap_context": { >>>>>> "seq": 28, >>>>>> "snaps": [ >>>>>> 28 >>>>>> ] >>>>>> }, >>>>>> "head_exists": 1, >>>>>> "clones": [ >>>>>> { >>>>>> "snap": 28, >>>>>> "size": 68, >>>>>> "overlap": "[0~64]" >>>>>> } >>>>>> ] >>>>>> } >>>>>> } >>>>>> >>>>>> Then we turn to the source code to find the reason for this. And we >>>>>> found that, the reason should be that, in the >>>>>> ReplicatedPG::make_writeable method, when determining whether the >>>>>> write op's modified range should be subtracted from the clone overlap, >>>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0" >>>>>> and "is_present_clone(last_clone_oid)", as the code below shows. >>>>>> >>>>>> // update most recent clone_overlap and usage stats >>>>>> if (ctx->new_snapset.clones.size() > 0) { >>>>>> /* we need to check whether the most recent clone exists, if it's >>>>>> been evicted, >>>>>> * it's not included in the stats */ >>>>>> hobject_t last_clone_oid = soid; >>>>>> last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; >>>>>> if (is_present_clone(last_clone_oid)) { >>>>>> interval_set<uint64_t> &newest_overlap = >>>>>> ctx->new_snapset.clone_overlap.rbegin()->second; >>>>>> ctx->modified_ranges.intersection_of(newest_overlap); >>>>>> // modified_ranges is still in use by the clone >>>>>> add_interval_usage(ctx->modified_ranges, ctx->delta_stats); >>>>>> newest_overlap.subtract(ctx->modified_ranges); >>>>>> } >>>>>> } >>>>>> >>>>>> We thought that the latter condition check >>>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a >>>>>> judgement base for the determination of whether to subtract the clone >>>>>> overlap with write ops' modified range, so we changed to code above to >>>>>> the following, which moved the subtraction out of the latter condition >>>>>> check, and submitted a pr for this: >>>>>> https://github.com/ceph/ceph/pull/16790: >>>>>> >>>>>> // update most recent clone_overlap and usage stats >>>>>> if (ctx->new_snapset.clones.size() > 0) { >>>>>> /* we need to check whether the most recent clone exists, if it's >>>>>> been evicted, >>>>>> * it's not included in the stats */ >>>>>> hobject_t last_clone_oid = soid; >>>>>> last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; >>>>>> interval_set<uint64_t> &newest_overlap = >>>>>> ctx->new_snapset.clone_overlap.rbegin()->second; >>>>>> ctx->modified_ranges.intersection_of(newest_overlap); >>>>>> newest_overlap.subtract(ctx->modified_ranges); >>>>>> >>>>>> if (is_present_clone(last_clone_oid)) { >>>>>> // modified_ranges is still in use by the clone >>>>>> add_interval_usage(ctx->modified_ranges, ctx->delta_stats); >>>>>> } >>>>>> } >>>>>> >>>>>> In our test, this change solved the problem successfully, however, we >>>>>> can't confirm that this change won't cause any new problems. So, here >>>>>> we are discussing how to solve this problem:-) >>>>>> >>>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626? >>>>>>> >>>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>> OK, I'll try that. Thank you:-) >>>>>>>> >>>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to >>>>>>>>> the base tier and combine the cache tier + base tier results? Reading >>>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around >>>>>>>>> this within librbd by just refusing the provide intra-object diffs if >>>>>>>>> cache tiering is in-use. >>>>>>>>> >>>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>>>> Hi, everyone. >>>>>>>>>> >>>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I >>>>>>>>>> just did another test: I did some writes to an object >>>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object >>>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40 >>>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to >>>>>>>>>> dump its "HEAD" version in the base tier, the result is as >>>>>>>>>> follows(before I raise it to cache tier, there is three snaps the >>>>>>>>>> latest of which is 26): >>>>>>>>>> >>>>>>>>>> { >>>>>>>>>> "id": { >>>>>>>>>> "oid": "rbd_data.1ebc6238e1f29.0000000000000000", >>>>>>>>>> "key": "", >>>>>>>>>> "snapid": -2, >>>>>>>>>> "hash": 1655893237, >>>>>>>>>> "max": 0, >>>>>>>>>> "pool": 3, >>>>>>>>>> "namespace": "", >>>>>>>>>> "max": 0 >>>>>>>>>> }, >>>>>>>>>> "info": { >>>>>>>>>> "oid": { >>>>>>>>>> "oid": "rbd_data.1ebc6238e1f29.0000000000000000", >>>>>>>>>> "key": "", >>>>>>>>>> "snapid": -2, >>>>>>>>>> "hash": 1655893237, >>>>>>>>>> "max": 0, >>>>>>>>>> "pool": 3, >>>>>>>>>> "namespace": "" >>>>>>>>>> }, >>>>>>>>>> "version": "4219'16423", >>>>>>>>>> "prior_version": "3978'16310", >>>>>>>>>> "last_reqid": "osd.70.4213:2359", >>>>>>>>>> "user_version": 17205, >>>>>>>>>> "size": 4194304, >>>>>>>>>> "mtime": "2017-08-03 22:07:34.656122", >>>>>>>>>> "local_mtime": "2017-08-05 23:02:33.628734", >>>>>>>>>> "lost": 0, >>>>>>>>>> "flags": 52, >>>>>>>>>> "snaps": [], >>>>>>>>>> "truncate_seq": 0, >>>>>>>>>> "truncate_size": 0, >>>>>>>>>> "data_digest": 2822203961, >>>>>>>>>> "omap_digest": 4294967295, >>>>>>>>>> "watchers": {} >>>>>>>>>> }, >>>>>>>>>> "stat": { >>>>>>>>>> "size": 4194304, >>>>>>>>>> "blksize": 4096, >>>>>>>>>> "blocks": 8200, >>>>>>>>>> "nlink": 1 >>>>>>>>>> }, >>>>>>>>>> "SnapSet": { >>>>>>>>>> "snap_context": { >>>>>>>>>> "seq": 26, >>>>>>>>>> "snaps": [ >>>>>>>>>> 26, >>>>>>>>>> 25, >>>>>>>>>> 16 >>>>>>>>>> ] >>>>>>>>>> }, >>>>>>>>>> "head_exists": 1, >>>>>>>>>> "clones": [ >>>>>>>>>> { >>>>>>>>>> "snap": 16, >>>>>>>>>> "size": 4194304, >>>>>>>>>> "overlap": "[4~4194300]" >>>>>>>>>> }, >>>>>>>>>> { >>>>>>>>>> "snap": 25, >>>>>>>>>> "size": 4194304, >>>>>>>>>> "overlap": "[]" >>>>>>>>>> }, >>>>>>>>>> { >>>>>>>>>> "snap": 26, >>>>>>>>>> "size": 4194304, >>>>>>>>>> "overlap": "[]" >>>>>>>>>> } >>>>>>>>>> ] >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which, >>>>>>>>>> combining with the previous test described in >>>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified >>>>>>>>>> range" is neither recorded in the cache tier nor in the base tier. >>>>>>>>>> >>>>>>>>>> I think maybe we really should move the clone overlap modification out >>>>>>>>>> of the IF block which has the condition check "is_present_clone". As >>>>>>>>>> for now, I can't see any other way to fix this problem. >>>>>>>>>> >>>>>>>>>> Am I right about this? >>>>>>>>>> >>>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>>>>> I mean I think it's the condition check "is_present_clone" that >>>>>>>>>>> prevent the clone overlap to record the client write operations >>>>>>>>>>> modified range when the target "HEAD" object exists without its most >>>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap >>>>>>>>>>> modification out of the "is_present_clone" condition check block is >>>>>>>>>>> enough to solve this case, just like the PR >>>>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause >>>>>>>>>>> other problems. >>>>>>>>>>> >>>>>>>>>>> In our test, this fix solved the problem successfully, however, we >>>>>>>>>>> can't confirm it won't cause new problems yet. >>>>>>>>>>> >>>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-) >>>>>>>>>>> >>>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>>>>>> Hi, grep:-) >>>>>>>>>>>> >>>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790. >>>>>>>>>>>> >>>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking >>>>>>>>>>>> which data is the same on disk". >>>>>>>>>>>> >>>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an >>>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the >>>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to >>>>>>>>>>>> a "present clone" in cache tier, as long as >>>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one >>>>>>>>>>>> such clone object in the base tier. And, as long as >>>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one" >>>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing >>>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is >>>>>>>>>>>> enough to make the judgement that the clone object exists. >>>>>>>>>>>> >>>>>>>>>>>> So, if I'm right, passing the condition check "if >>>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do >>>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to >>>>>>>>>>>> pass "is_present_clone". >>>>>>>>>>>> >>>>>>>>>>>> Am I right about this? Or am I missing anything? >>>>>>>>>>>> >>>>>>>>>>>> Please help us, thank you:-) >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jason >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jason >>>> >>>> >>>> >>>> -- >>>> Jason >> >> >> >> -- >> Jason -- Jason -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html