Yes, it's quite possible that for simplicity sake the cache tier OSD writes back the full object to the base tier even if only a single byte has changed. In that case, while not optimal behavior, at least if you combine the base + cache tier snapsets together, RBD should be able to export enough data to reconstruct the image. On Thu, Aug 10, 2017 at 3:06 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: > Hi, Jason. > > I did a test, it turned out that, after flushing the object out of the > cache tier, the clone overlap in base tier changed to empty, as is > shown below. I think this maybe because that the flush operation just > mark the whole range of the object being flushed as modified, so if > the object's size hasn't changed, the overlap becomes empty. Is this > right? > > Thank you:-) > > { > "id": { > "oid": "test.obj", > "key": "", > "snapid": -2, > "hash": 3575411564, > "max": 0, > "pool": 10, > "namespace": "", > "max": 0 > }, > "info": { > "oid": { > "oid": "test.obj", > "key": "", > "snapid": -2, > "hash": 3575411564, > "max": 0, > "pool": 10, > "namespace": "" > }, > "version": "4876'9", > "prior_version": "4854'8", > "last_reqid": "osd.35.4869:1", > "user_version": 16, > "size": 4194303, > "mtime": "2017-08-10 14:54:56.087387", > "local_mtime": "2017-08-10 14:59:15.252755", > "lost": 0, > "flags": 52, > "snaps": [], > "truncate_seq": 0, > "truncate_size": 0, > "data_digest": 2827420887, > "omap_digest": 4294967295, > "watchers": {} > }, > "stat": { > "size": 4194303, > "blksize": 4096, > "blocks": 8200, > "nlink": 1 > }, > "SnapSet": { > "snap_context": { > "seq": 3, > "snaps": [ > 3 > ] > }, > "head_exists": 1, > "clones": [ > { > "snap": 3, > "size": 4194303, > "overlap": "[]" > } > ] > } > } > > On 9 August 2017 at 23:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >> If you flush the object out of the cache tier so that its changes are >> recorded in the base tier, is the overlap correctly recorded in the >> base tier? >> >> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>> By the way, according to our test, since the modified range is not >>> recorded either in the cache tier or in the base tier, I think >>> proxying the "list-snaps" down to the base tier might not work as >>> well, is that right? >>> >>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>> Sorry, I didn't "reply all"....:-) >>>> >>>> >>>> ---------- Forwarded message ---------- >>>> From: Xuehan Xu <xxhdx1985126@xxxxxxxxx> >>>> Date: 9 August 2017 at 12:14 >>>> Subject: Re: About the problem "export_diff relies on clone_overlap, >>>> which is lost when cache tier is enabled" >>>> To: dillaman@xxxxxxxxxx >>>> >>>> >>>> Um, no, I don't think they are related. >>>> >>>> My problem is this: >>>> >>>> At first , we tried to use "rbd export-diff" to do incremental backup >>>> on Jewel verion ceph cluster which is cache-tier enabled. However, >>>> when we compare the original rbd image and the backup rbd image, we >>>> find that they are different. After a series of debugging, we found >>>> that this is because WRITE ops' "modified_range" is not substracted >>>> from the clone overlap of its targeting object's HEAD version when >>>> that object's HEAD verion is in cache iter and its most recent clone >>>> object is not, which led to the miscalculation of the >>>> "calc_snap_set_diff" function. >>>> >>>> For example, we did such a test: we first made created a snap for an >>>> rbd image "test.2.img" whose size is only 4MB which means it only >>>> contains one object; then, we sent a series of AioWrites to >>>> "test.2.img" to promote its HEAD object into cache tier, while leaving >>>> its clone object in the base tier only; at that time, we used >>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result >>>> was as follows: >>>> >>>> { >>>> "id": { >>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>> "key": "", >>>> "snapid": -2, >>>> "hash": 2375431681, >>>> "max": 0, >>>> "pool": 4, >>>> "namespace": "", >>>> "max": 0 >>>> }, >>>> "info": { >>>> "oid": { >>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>> "key": "", >>>> "snapid": -2, >>>> "hash": 2375431681, >>>> "max": 0, >>>> "pool": 4, >>>> "namespace": "" >>>> }, >>>> "version": "4536'2728", >>>> "prior_version": "4536'2727", >>>> "last_reqid": "client.174858.0:10", >>>> "user_version": 14706, >>>> "size": 68, >>>> "mtime": "2017-08-09 11:30:18.263983", >>>> "local_mtime": "2017-08-09 11:30:18.264310", >>>> "lost": 0, >>>> "flags": 4, >>>> "snaps": [], >>>> "truncate_seq": 0, >>>> "truncate_size": 0, >>>> "data_digest": 4294967295, >>>> "omap_digest": 4294967295, >>>> "watchers": {} >>>> }, >>>> "stat": { >>>> "size": 68, >>>> "blksize": 4096, >>>> "blocks": 16, >>>> "nlink": 1 >>>> }, >>>> "SnapSet": { >>>> "snap_context": { >>>> "seq": 28, >>>> "snaps": [ >>>> 28 >>>> ] >>>> }, >>>> "head_exists": 1, >>>> "clones": [ >>>> { >>>> "snap": 28, >>>> "size": 68, >>>> "overlap": "[0~64]" >>>> } >>>> ] >>>> } >>>> } >>>> >>>> In this result, we found that the overlap for clone "28" is [0~64]. So >>>> we specifically sent a AioWrite req to this object to write to the >>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap >>>> should change to [0~32, 36~64]. However, the result is as follows: >>>> >>>> { >>>> "id": { >>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>> "key": "", >>>> "snapid": -2, >>>> "hash": 2375431681, >>>> "max": 0, >>>> "pool": 4, >>>> "namespace": "", >>>> "max": 0 >>>> }, >>>> "info": { >>>> "oid": { >>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>> "key": "", >>>> "snapid": -2, >>>> "hash": 2375431681, >>>> "max": 0, >>>> "pool": 4, >>>> "namespace": "" >>>> }, >>>> "version": "4546'2730", >>>> "prior_version": "4538'2729", >>>> "last_reqid": "client.155555.0:10", >>>> "user_version": 14708, >>>> "size": 4096, >>>> "mtime": "2017-08-09 11:36:20.717910", >>>> "local_mtime": "2017-08-09 11:36:20.719039", >>>> "lost": 0, >>>> "flags": 4, >>>> "snaps": [], >>>> "truncate_seq": 0, >>>> "truncate_size": 0, >>>> "data_digest": 4294967295, >>>> "omap_digest": 4294967295, >>>> "watchers": {} >>>> }, >>>> "stat": { >>>> "size": 4096, >>>> "blksize": 4096, >>>> "blocks": 16, >>>> "nlink": 1 >>>> }, >>>> "SnapSet": { >>>> "snap_context": { >>>> "seq": 28, >>>> "snaps": [ >>>> 28 >>>> ] >>>> }, >>>> "head_exists": 1, >>>> "clones": [ >>>> { >>>> "snap": 28, >>>> "size": 68, >>>> "overlap": "[0~64]" >>>> } >>>> ] >>>> } >>>> } >>>> >>>> It's obvious that it didn't change at all. If we do export-diff under >>>> this circumstance, the result would be wrong. Meanwhile, in the base >>>> tier, the "ceph-objectstore-tool" dump's result also showed that the >>>> overlap recorded in the base tier didn't change also: >>>> { >>>> "id": { >>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>> "key": "", >>>> "snapid": -2, >>>> "hash": 2375431681, >>>> "max": 0, >>>> "pool": 3, >>>> "namespace": "", >>>> "max": 0 >>>> }, >>>> "info": { >>>> "oid": { >>>> "oid": "rbd_data.2aae62ae8944a.0000000000000000", >>>> "key": "", >>>> "snapid": -2, >>>> "hash": 2375431681, >>>> "max": 0, >>>> "pool": 3, >>>> "namespace": "" >>>> }, >>>> "version": "4536'14459", >>>> "prior_version": "4536'14458", >>>> "last_reqid": "client.174834.0:10", >>>> "user_version": 14648, >>>> "size": 68, >>>> "mtime": "2017-08-09 11:30:01.412634", >>>> "local_mtime": "2017-08-09 11:30:01.413614", >>>> "lost": 0, >>>> "flags": 36, >>>> "snaps": [], >>>> "truncate_seq": 0, >>>> "truncate_size": 0, >>>> "data_digest": 4294967295, >>>> "omap_digest": 4294967295, >>>> "watchers": {} >>>> }, >>>> "stat": { >>>> "size": 68, >>>> "blksize": 4096, >>>> "blocks": 16, >>>> "nlink": 1 >>>> }, >>>> "SnapSet": { >>>> "snap_context": { >>>> "seq": 28, >>>> "snaps": [ >>>> 28 >>>> ] >>>> }, >>>> "head_exists": 1, >>>> "clones": [ >>>> { >>>> "snap": 28, >>>> "size": 68, >>>> "overlap": "[0~64]" >>>> } >>>> ] >>>> } >>>> } >>>> >>>> Then we turn to the source code to find the reason for this. And we >>>> found that, the reason should be that, in the >>>> ReplicatedPG::make_writeable method, when determining whether the >>>> write op's modified range should be subtracted from the clone overlap, >>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0" >>>> and "is_present_clone(last_clone_oid)", as the code below shows. >>>> >>>> // update most recent clone_overlap and usage stats >>>> if (ctx->new_snapset.clones.size() > 0) { >>>> /* we need to check whether the most recent clone exists, if it's >>>> been evicted, >>>> * it's not included in the stats */ >>>> hobject_t last_clone_oid = soid; >>>> last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; >>>> if (is_present_clone(last_clone_oid)) { >>>> interval_set<uint64_t> &newest_overlap = >>>> ctx->new_snapset.clone_overlap.rbegin()->second; >>>> ctx->modified_ranges.intersection_of(newest_overlap); >>>> // modified_ranges is still in use by the clone >>>> add_interval_usage(ctx->modified_ranges, ctx->delta_stats); >>>> newest_overlap.subtract(ctx->modified_ranges); >>>> } >>>> } >>>> >>>> We thought that the latter condition check >>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a >>>> judgement base for the determination of whether to subtract the clone >>>> overlap with write ops' modified range, so we changed to code above to >>>> the following, which moved the subtraction out of the latter condition >>>> check, and submitted a pr for this: >>>> https://github.com/ceph/ceph/pull/16790: >>>> >>>> // update most recent clone_overlap and usage stats >>>> if (ctx->new_snapset.clones.size() > 0) { >>>> /* we need to check whether the most recent clone exists, if it's >>>> been evicted, >>>> * it's not included in the stats */ >>>> hobject_t last_clone_oid = soid; >>>> last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; >>>> interval_set<uint64_t> &newest_overlap = >>>> ctx->new_snapset.clone_overlap.rbegin()->second; >>>> ctx->modified_ranges.intersection_of(newest_overlap); >>>> newest_overlap.subtract(ctx->modified_ranges); >>>> >>>> if (is_present_clone(last_clone_oid)) { >>>> // modified_ranges is still in use by the clone >>>> add_interval_usage(ctx->modified_ranges, ctx->delta_stats); >>>> } >>>> } >>>> >>>> In our test, this change solved the problem successfully, however, we >>>> can't confirm that this change won't cause any new problems. So, here >>>> we are discussing how to solve this problem:-) >>>> >>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626? >>>>> >>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>> OK, I'll try that. Thank you:-) >>>>>> >>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>>>>>> Could you just proxy the "list snaps" op from the cache tier down to >>>>>>> the base tier and combine the cache tier + base tier results? Reading >>>>>>> the associated ticket, it seems kludgy to me to attempt to work around >>>>>>> this within librbd by just refusing the provide intra-object diffs if >>>>>>> cache tiering is in-use. >>>>>>> >>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>> Hi, everyone. >>>>>>>> >>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896", I >>>>>>>> just did another test: I did some writes to an object >>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object >>>>>>>> to the cache tier, after which I specifically write to its offset 0x40 >>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to >>>>>>>> dump its "HEAD" version in the base tier, the result is as >>>>>>>> follows(before I raise it to cache tier, there is three snaps the >>>>>>>> latest of which is 26): >>>>>>>> >>>>>>>> { >>>>>>>> "id": { >>>>>>>> "oid": "rbd_data.1ebc6238e1f29.0000000000000000", >>>>>>>> "key": "", >>>>>>>> "snapid": -2, >>>>>>>> "hash": 1655893237, >>>>>>>> "max": 0, >>>>>>>> "pool": 3, >>>>>>>> "namespace": "", >>>>>>>> "max": 0 >>>>>>>> }, >>>>>>>> "info": { >>>>>>>> "oid": { >>>>>>>> "oid": "rbd_data.1ebc6238e1f29.0000000000000000", >>>>>>>> "key": "", >>>>>>>> "snapid": -2, >>>>>>>> "hash": 1655893237, >>>>>>>> "max": 0, >>>>>>>> "pool": 3, >>>>>>>> "namespace": "" >>>>>>>> }, >>>>>>>> "version": "4219'16423", >>>>>>>> "prior_version": "3978'16310", >>>>>>>> "last_reqid": "osd.70.4213:2359", >>>>>>>> "user_version": 17205, >>>>>>>> "size": 4194304, >>>>>>>> "mtime": "2017-08-03 22:07:34.656122", >>>>>>>> "local_mtime": "2017-08-05 23:02:33.628734", >>>>>>>> "lost": 0, >>>>>>>> "flags": 52, >>>>>>>> "snaps": [], >>>>>>>> "truncate_seq": 0, >>>>>>>> "truncate_size": 0, >>>>>>>> "data_digest": 2822203961, >>>>>>>> "omap_digest": 4294967295, >>>>>>>> "watchers": {} >>>>>>>> }, >>>>>>>> "stat": { >>>>>>>> "size": 4194304, >>>>>>>> "blksize": 4096, >>>>>>>> "blocks": 8200, >>>>>>>> "nlink": 1 >>>>>>>> }, >>>>>>>> "SnapSet": { >>>>>>>> "snap_context": { >>>>>>>> "seq": 26, >>>>>>>> "snaps": [ >>>>>>>> 26, >>>>>>>> 25, >>>>>>>> 16 >>>>>>>> ] >>>>>>>> }, >>>>>>>> "head_exists": 1, >>>>>>>> "clones": [ >>>>>>>> { >>>>>>>> "snap": 16, >>>>>>>> "size": 4194304, >>>>>>>> "overlap": "[4~4194300]" >>>>>>>> }, >>>>>>>> { >>>>>>>> "snap": 25, >>>>>>>> "size": 4194304, >>>>>>>> "overlap": "[]" >>>>>>>> }, >>>>>>>> { >>>>>>>> "snap": 26, >>>>>>>> "size": 4194304, >>>>>>>> "overlap": "[]" >>>>>>>> } >>>>>>>> ] >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which, >>>>>>>> combining with the previous test described in >>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified >>>>>>>> range" is neither recorded in the cache tier nor in the base tier. >>>>>>>> >>>>>>>> I think maybe we really should move the clone overlap modification out >>>>>>>> of the IF block which has the condition check "is_present_clone". As >>>>>>>> for now, I can't see any other way to fix this problem. >>>>>>>> >>>>>>>> Am I right about this? >>>>>>>> >>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>>> I mean I think it's the condition check "is_present_clone" that >>>>>>>>> prevent the clone overlap to record the client write operations >>>>>>>>> modified range when the target "HEAD" object exists without its most >>>>>>>>> recent clone object, and if I'm right, just move the clone overlap >>>>>>>>> modification out of the "is_present_clone" condition check block is >>>>>>>>> enough to solve this case, just like the PR >>>>>>>>> "https://github.com/ceph/ceph/pull/16790", and this fix wouldn't cause >>>>>>>>> other problems. >>>>>>>>> >>>>>>>>> In our test, this fix solved the problem successfully, however, we >>>>>>>>> can't confirm it won't cause new problems yet. >>>>>>>>> >>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-) >>>>>>>>> >>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>>>>>>>>> Hi, grep:-) >>>>>>>>>> >>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790. >>>>>>>>>> >>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking >>>>>>>>>> which data is the same on disk". >>>>>>>>>> >>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an >>>>>>>>>> indicator about whether there are clone objects on disk, so, in the >>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to >>>>>>>>>> a "present clone" in cache tier, as long as >>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one >>>>>>>>>> such clone object in the base tier. And, as long as >>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one" >>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing >>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is >>>>>>>>>> enough to make the judgement that the clone object exists. >>>>>>>>>> >>>>>>>>>> So, if I'm right, passing the condition check "if >>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do >>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to >>>>>>>>>> pass "is_present_clone". >>>>>>>>>> >>>>>>>>>> Am I right about this? Or am I missing anything? >>>>>>>>>> >>>>>>>>>> Please help us, thank you:-) >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jason >>>>> >>>>> >>>>> >>>>> -- >>>>> Jason >> >> >> >> -- >> Jason -- Jason -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html