Re: About the problem "export_diff relies on clone_overlap, which is lost when cache tier is enabled"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Is clone_overlap used only for the purpose of calculating the changed
subsets between clones? Or is it also used for other purpose?

On 10 August 2017 at 16:11, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
> By the way, Um..., I'm doing those test on version 10.2.5
>
> On 10 August 2017 at 15:06, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>> Hi, Jason.
>>
>> I did a test, it turned out that, after flushing the object out of the
>> cache tier, the clone overlap in base tier changed to empty, as is
>> shown below. I think this maybe because that the flush operation just
>> mark the whole range of the object being flushed as modified, so if
>> the object's size hasn't changed, the overlap becomes empty. Is this
>> right?
>>
>> Thank you:-)
>>
>> {
>>     "id": {
>>         "oid": "test.obj",
>>         "key": "",
>>         "snapid": -2,
>>         "hash": 3575411564,
>>         "max": 0,
>>         "pool": 10,
>>         "namespace": "",
>>         "max": 0
>>     },
>>     "info": {
>>         "oid": {
>>             "oid": "test.obj",
>>             "key": "",
>>             "snapid": -2,
>>             "hash": 3575411564,
>>             "max": 0,
>>             "pool": 10,
>>             "namespace": ""
>>         },
>>         "version": "4876'9",
>>         "prior_version": "4854'8",
>>         "last_reqid": "osd.35.4869:1",
>>         "user_version": 16,
>>         "size": 4194303,
>>         "mtime": "2017-08-10 14:54:56.087387",
>>         "local_mtime": "2017-08-10 14:59:15.252755",
>>         "lost": 0,
>>         "flags": 52,
>>         "snaps": [],
>>         "truncate_seq": 0,
>>         "truncate_size": 0,
>>         "data_digest": 2827420887,
>>         "omap_digest": 4294967295,
>>         "watchers": {}
>>     },
>>     "stat": {
>>         "size": 4194303,
>>         "blksize": 4096,
>>         "blocks": 8200,
>>         "nlink": 1
>>     },
>>     "SnapSet": {
>>         "snap_context": {
>>             "seq": 3,
>>             "snaps": [
>>                 3
>>             ]
>>         },
>>         "head_exists": 1,
>>         "clones": [
>>             {
>>                 "snap": 3,
>>                 "size": 4194303,
>>                 "overlap": "[]"
>>             }
>>         ]
>>     }
>> }
>>
>> On 9 August 2017 at 23:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>>> If you flush the object out of the cache tier so that its changes are
>>> recorded in the base tier, is the overlap correctly recorded in the
>>> base tier?
>>>
>>> On Wed, Aug 9, 2017 at 12:24 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>>>> By the way, according to our test, since the modified range is not
>>>> recorded either in the cache tier or in the base tier, I think
>>>> proxying the "list-snaps" down to the base tier might not work as
>>>> well, is that right?
>>>>
>>>> On 9 August 2017 at 12:20, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>>>>> Sorry, I didn't "reply all"....:-)
>>>>>
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Xuehan Xu <xxhdx1985126@xxxxxxxxx>
>>>>> Date: 9 August 2017 at 12:14
>>>>> Subject: Re: About the problem "export_diff relies on clone_overlap,
>>>>> which is lost when cache tier is enabled"
>>>>> To: dillaman@xxxxxxxxxx
>>>>>
>>>>>
>>>>> Um, no, I don't think they are related.
>>>>>
>>>>> My problem is this:
>>>>>
>>>>> At first , we tried to use "rbd export-diff" to do incremental backup
>>>>> on Jewel verion ceph cluster which is cache-tier enabled. However,
>>>>> when we compare the original rbd image and the backup rbd image, we
>>>>> find that they are different. After a series of debugging, we found
>>>>> that this is because WRITE ops' "modified_range" is not substracted
>>>>> from the clone overlap of its targeting object's HEAD version when
>>>>> that object's HEAD verion is in cache iter and its most recent clone
>>>>> object is not, which led to the miscalculation of the
>>>>> "calc_snap_set_diff" function.
>>>>>
>>>>> For example, we did such a test: we first made created a snap for an
>>>>> rbd image "test.2.img" whose size is only 4MB which means it only
>>>>> contains one object; then, we sent a series of AioWrites to
>>>>> "test.2.img" to promote its HEAD object into cache tier, while leaving
>>>>> its clone object in the base tier only; at that time, we used
>>>>> "ceph-objectstore-tool" to dump the object we wrote to, and the result
>>>>> was as follows:
>>>>>
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 4,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 4,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4536'2728",
>>>>>         "prior_version": "4536'2727",
>>>>>         "last_reqid": "client.174858.0:10",
>>>>>         "user_version": 14706,
>>>>>         "size": 68,
>>>>>         "mtime": "2017-08-09 11:30:18.263983",
>>>>>         "local_mtime": "2017-08-09 11:30:18.264310",
>>>>>         "lost": 0,
>>>>>         "flags": 4,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 68,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> In this result, we found that the overlap for clone "28" is [0~64]. So
>>>>> we specifically sent a AioWrite req to this object to write to the
>>>>> offset 32 with 4 bytes of ramdon data, which we thought the overlap
>>>>> should change to [0~32, 36~64]. However, the result is as follows:
>>>>>
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 4,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 4,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4546'2730",
>>>>>         "prior_version": "4538'2729",
>>>>>         "last_reqid": "client.155555.0:10",
>>>>>         "user_version": 14708,
>>>>>         "size": 4096,
>>>>>         "mtime": "2017-08-09 11:36:20.717910",
>>>>>         "local_mtime": "2017-08-09 11:36:20.719039",
>>>>>         "lost": 0,
>>>>>         "flags": 4,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 4096,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> It's obvious that it didn't change at all. If we do export-diff under
>>>>> this circumstance, the result would be wrong. Meanwhile, in the base
>>>>> tier, the "ceph-objectstore-tool" dump's result also showed that the
>>>>> overlap recorded in the base tier didn't change also:
>>>>> {
>>>>>     "id": {
>>>>>         "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>         "key": "",
>>>>>         "snapid": -2,
>>>>>         "hash": 2375431681,
>>>>>         "max": 0,
>>>>>         "pool": 3,
>>>>>         "namespace": "",
>>>>>         "max": 0
>>>>>     },
>>>>>     "info": {
>>>>>         "oid": {
>>>>>             "oid": "rbd_data.2aae62ae8944a.0000000000000000",
>>>>>             "key": "",
>>>>>             "snapid": -2,
>>>>>             "hash": 2375431681,
>>>>>             "max": 0,
>>>>>             "pool": 3,
>>>>>             "namespace": ""
>>>>>         },
>>>>>         "version": "4536'14459",
>>>>>         "prior_version": "4536'14458",
>>>>>         "last_reqid": "client.174834.0:10",
>>>>>         "user_version": 14648,
>>>>>         "size": 68,
>>>>>         "mtime": "2017-08-09 11:30:01.412634",
>>>>>         "local_mtime": "2017-08-09 11:30:01.413614",
>>>>>         "lost": 0,
>>>>>         "flags": 36,
>>>>>         "snaps": [],
>>>>>         "truncate_seq": 0,
>>>>>         "truncate_size": 0,
>>>>>         "data_digest": 4294967295,
>>>>>         "omap_digest": 4294967295,
>>>>>         "watchers": {}
>>>>>     },
>>>>>     "stat": {
>>>>>         "size": 68,
>>>>>         "blksize": 4096,
>>>>>         "blocks": 16,
>>>>>         "nlink": 1
>>>>>     },
>>>>>     "SnapSet": {
>>>>>         "snap_context": {
>>>>>             "seq": 28,
>>>>>             "snaps": [
>>>>>                 28
>>>>>             ]
>>>>>         },
>>>>>         "head_exists": 1,
>>>>>         "clones": [
>>>>>             {
>>>>>                 "snap": 28,
>>>>>                 "size": 68,
>>>>>                 "overlap": "[0~64]"
>>>>>             }
>>>>>         ]
>>>>>     }
>>>>> }
>>>>>
>>>>> Then we turn to the source code to find the reason for this. And we
>>>>> found that, the reason should be that, in the
>>>>> ReplicatedPG::make_writeable method, when determining whether the
>>>>> write op's modified range should be subtracted from the clone overlap,
>>>>> it has pass two condition check: "ctx->new_snapset.clones.size() > 0"
>>>>> and "is_present_clone(last_clone_oid)", as the code below shows.
>>>>>
>>>>>   // update most recent clone_overlap and usage stats
>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>> been evicted,
>>>>>      * it's not included in the stats */
>>>>>     hobject_t last_clone_oid = soid;
>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>       interval_set<uint64_t> &newest_overlap =
>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>       ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>       // modified_ranges is still in use by the clone
>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>       newest_overlap.subtract(ctx->modified_ranges);
>>>>>     }
>>>>>   }
>>>>>
>>>>> We thought that the latter condition check
>>>>> "is_present_clone(last_clone_oid)" might not be reasonable to be a
>>>>> judgement base for the determination of whether to subtract the clone
>>>>> overlap with write ops' modified range, so we changed to code above to
>>>>> the following, which moved the subtraction out of the latter condition
>>>>> check, and submitted a pr for this:
>>>>> https://github.com/ceph/ceph/pull/16790:
>>>>>
>>>>>   // update most recent clone_overlap and usage stats
>>>>>   if (ctx->new_snapset.clones.size() > 0) {
>>>>>     /* we need to check whether the most recent clone exists, if it's
>>>>> been evicted,
>>>>>      * it's not included in the stats */
>>>>>     hobject_t last_clone_oid = soid;
>>>>>     last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first;
>>>>>     interval_set<uint64_t> &newest_overlap =
>>>>> ctx->new_snapset.clone_overlap.rbegin()->second;
>>>>>     ctx->modified_ranges.intersection_of(newest_overlap);
>>>>>     newest_overlap.subtract(ctx->modified_ranges);
>>>>>
>>>>>     if (is_present_clone(last_clone_oid)) {
>>>>>       // modified_ranges is still in use by the clone
>>>>>       add_interval_usage(ctx->modified_ranges, ctx->delta_stats);
>>>>>     }
>>>>>   }
>>>>>
>>>>> In our test, this change solved the problem successfully, however, we
>>>>> can't confirm that this change won't cause any new problems. So, here
>>>>> we are discussing how to solve this problem:-)
>>>>>
>>>>> On 9 August 2017 at 09:26, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>>>>>> Is this issue related to https://github.com/ceph/ceph/pull/10626?
>>>>>>
>>>>>> On Mon, Aug 7, 2017 at 8:07 PM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>>>>>>> OK, I'll try that. Thank you:-)
>>>>>>>
>>>>>>> On 8 August 2017 at 04:48, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>>>>>>>> Could you just proxy the "list snaps" op from the cache tier down to
>>>>>>>> the base tier and combine the cache tier + base tier results? Reading
>>>>>>>> the associated ticket, it seems kludgy to me to attempt to work around
>>>>>>>> this within librbd by just refusing the provide intra-object diffs if
>>>>>>>> cache tiering is in-use.
>>>>>>>>
>>>>>>>> On Sat, Aug 5, 2017 at 11:56 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>>>>>>>>> Hi, everyone.
>>>>>>>>>
>>>>>>>>> Trying to solve the issue "http://tracker.ceph.com/issues/20896";, I
>>>>>>>>> just did another test: I did some writes to an object
>>>>>>>>> "rbd_data.1ebc6238e1f29.0000000000000000" to raise its "HEAD" object
>>>>>>>>> to the cache tier, after which I specifically write to its offset 0x40
>>>>>>>>> with 4 bytes of random data. Then I used "ceph-objectstore-tool" to
>>>>>>>>> dump its "HEAD" version in the base tier, the result is as
>>>>>>>>> follows(before I raise it to cache tier, there is three snaps the
>>>>>>>>> latest of which is 26):
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>     "id": {
>>>>>>>>>         "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>         "key": "",
>>>>>>>>>         "snapid": -2,
>>>>>>>>>         "hash": 1655893237,
>>>>>>>>>         "max": 0,
>>>>>>>>>         "pool": 3,
>>>>>>>>>         "namespace": "",
>>>>>>>>>         "max": 0
>>>>>>>>>     },
>>>>>>>>>     "info": {
>>>>>>>>>         "oid": {
>>>>>>>>>             "oid": "rbd_data.1ebc6238e1f29.0000000000000000",
>>>>>>>>>             "key": "",
>>>>>>>>>             "snapid": -2,
>>>>>>>>>             "hash": 1655893237,
>>>>>>>>>             "max": 0,
>>>>>>>>>             "pool": 3,
>>>>>>>>>             "namespace": ""
>>>>>>>>>         },
>>>>>>>>>         "version": "4219'16423",
>>>>>>>>>         "prior_version": "3978'16310",
>>>>>>>>>         "last_reqid": "osd.70.4213:2359",
>>>>>>>>>         "user_version": 17205,
>>>>>>>>>         "size": 4194304,
>>>>>>>>>         "mtime": "2017-08-03 22:07:34.656122",
>>>>>>>>>         "local_mtime": "2017-08-05 23:02:33.628734",
>>>>>>>>>         "lost": 0,
>>>>>>>>>         "flags": 52,
>>>>>>>>>         "snaps": [],
>>>>>>>>>         "truncate_seq": 0,
>>>>>>>>>         "truncate_size": 0,
>>>>>>>>>         "data_digest": 2822203961,
>>>>>>>>>         "omap_digest": 4294967295,
>>>>>>>>>         "watchers": {}
>>>>>>>>>     },
>>>>>>>>>     "stat": {
>>>>>>>>>         "size": 4194304,
>>>>>>>>>         "blksize": 4096,
>>>>>>>>>         "blocks": 8200,
>>>>>>>>>         "nlink": 1
>>>>>>>>>     },
>>>>>>>>>     "SnapSet": {
>>>>>>>>>         "snap_context": {
>>>>>>>>>             "seq": 26,
>>>>>>>>>             "snaps": [
>>>>>>>>>                 26,
>>>>>>>>>                 25,
>>>>>>>>>                 16
>>>>>>>>>             ]
>>>>>>>>>         },
>>>>>>>>>         "head_exists": 1,
>>>>>>>>>         "clones": [
>>>>>>>>>             {
>>>>>>>>>                 "snap": 16,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[4~4194300]"
>>>>>>>>>             },
>>>>>>>>>             {
>>>>>>>>>                 "snap": 25,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>             },
>>>>>>>>>             {
>>>>>>>>>                 "snap": 26,
>>>>>>>>>                 "size": 4194304,
>>>>>>>>>                 "overlap": "[]"
>>>>>>>>>             }
>>>>>>>>>         ]
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> As we can see, its clone_overlap for snap 26 is empty, which,
>>>>>>>>> combining with the previous test described in
>>>>>>>>> http://tracker.ceph.com/issues/20896, means that the writes' "modified
>>>>>>>>> range" is neither recorded in the cache tier nor in the base tier.
>>>>>>>>>
>>>>>>>>> I think maybe we really should move the clone overlap modification out
>>>>>>>>> of the IF block which has the condition check "is_present_clone". As
>>>>>>>>> for now, I can't see any other way to fix this problem.
>>>>>>>>>
>>>>>>>>> Am I right about this?
>>>>>>>>>
>>>>>>>>> On 4 August 2017 at 03:14, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>>>>>>>>>> I mean I think it's the condition check "is_present_clone" that
>>>>>>>>>> prevent the clone overlap to record the client write operations
>>>>>>>>>> modified range when the target "HEAD" object exists without its most
>>>>>>>>>> recent clone object, and if I'm right, just move the clone overlap
>>>>>>>>>> modification out of the "is_present_clone" condition check block is
>>>>>>>>>> enough to solve this case, just like the PR
>>>>>>>>>> "https://github.com/ceph/ceph/pull/16790";, and this fix wouldn't cause
>>>>>>>>>> other problems.
>>>>>>>>>>
>>>>>>>>>> In our test, this fix solved the problem successfully, however, we
>>>>>>>>>> can't confirm it won't cause new problems yet.
>>>>>>>>>>
>>>>>>>>>> So if anyone see this and knows the answer, please help us. Thank you:-)
>>>>>>>>>>
>>>>>>>>>> On 4 August 2017 at 11:41, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>>>>>>>>>>> Hi, grep:-)
>>>>>>>>>>>
>>>>>>>>>>> I finally got what you mean in https://github.com/ceph/ceph/pull/16790.
>>>>>>>>>>>
>>>>>>>>>>> I agree with you in that " clone overlap is supposed to be tracking
>>>>>>>>>>> which data is the same on disk".
>>>>>>>>>>>
>>>>>>>>>>> My thought is that, "ObjectContext::new_snapset.clones" is already an
>>>>>>>>>>> indicator about whether there are clone objects on disk, so, in the
>>>>>>>>>>> scenario of "cache tier", although a clone oid does not corresponds to
>>>>>>>>>>> a "present clone" in cache tier, as long as
>>>>>>>>>>> "ObjectContext::new_snapset.clones" is not empty, there must a one
>>>>>>>>>>> such clone object in the base tier. And, as long as
>>>>>>>>>>> "ObjectContext::new_snapset.clones" has a strict "one-to-one"
>>>>>>>>>>> correspondence to "ObjectContext::new_snapset.clone_overlap", passing
>>>>>>>>>>> the condition check "if (ctx->new_snapset.clones.size() > 0)" is
>>>>>>>>>>> enough to make the judgement that the clone object exists.
>>>>>>>>>>>
>>>>>>>>>>> So, if I'm right, passing the condition check "if
>>>>>>>>>>> (ctx->new_snapset.clones.size() > 0)" is already enough for us to do
>>>>>>>>>>> "newest_overlap.subtract(ctx->modified_ranges)", it doesn't have to
>>>>>>>>>>> pass "is_present_clone".
>>>>>>>>>>>
>>>>>>>>>>> Am I right about this? Or am I missing anything?
>>>>>>>>>>>
>>>>>>>>>>> Please help us, thank you:-)
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jason
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jason
>>>
>>>
>>>
>>> --
>>> Jason
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux