Re: [RFC]About ImageIndex

Haomai Wang <haomaiwang@xxxxxxxxx> · Wed, 13 Aug 2014 20:37:19 +0800

On Wed, Aug 13, 2014 at 10:14 AM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote:
> On 08/11/2014 07:50 PM, Haomai Wang wrote:
>>
>> Hi Sage, Josh:
>>
>> ImageIndex is aimed to hold each object's location info which avoid
>> extra checking for none-existing object. It's only used when image flags
>> exists LIBRBD_CREATE_NONSHARED. Otherwise, ImageIndex will become gawp and
>> has no effect.
>
>
> I like the general structure of the code, and it's great to hear you've
> been testing it with test_librbd_fsx. Some thoughts on the design:
>
>
>> Each object has three state:
>> 1. UNKNOWN: default value, it will follow origin path
>> 2. LOCAL: imply this object is local, don't need to lookup parent image
>> 3. PARENT: imply this object is in the parent image, don't need to read
>> from local image
>
>
> What about calling these unknown, exists, and nonexistent instead?
> This information can be used for non-clone use cases too. Internally
> there's another ImageCtx for each parent, which can have its own
> index, so I don't think there's a need to keep track of any objects
> from different images in one ImageCtx.

The names seemed better. But what's mean of latter sentences?
Now a image's ImageIndex only track own objects.

>
>
>> Note: ImageIndex isn't full sync to real data all the time. Because the
>> transformation {"unknown" -> "local", "unknown" -> "parent"} are safe.
>> So We only need to handle with the exception when ImageIndex implies
>> this object is "parent" but the real data is "local". There exists three
>> methods to solve it:
>> 1. flush `state_map` every time when "parent" -> "local" happened
>> 2. mark all objects from "parent" state to "unknown“ state when loading
>> image index(not including snapshot which has frozen index).
>
>
> At the end of the CDS session I was thinking that implementing it
> strictly would be better, so that you could always trust the on-disk copy of
> the index. In particular we were worried about the complexity of
> handling a non-strict index during things like live migration, and
> being able to use it to speed up management operations more reliably.
>
> Since creating new objects (or discarding entire existing objects) is
> pretty rare, it seems worth the extra I/Os to update the index
> synchronously in these cases. To be fully safe I think any write to
> a new object or discard of a full object would need to change index
> state to unknown, do the write/discard, and finally change the index
> state to exists/nonexistent. There might be some ways to optimize that
> though.

It make sense to me.

>
>
>> Here choose to implement method 2. This method only allow 2 read ops
>> in one read request at max and without overhead. Usually, librbd will
>> open image for many days(months) for normal use case such as VM usage.
>> So the image index will be warmed up and became smart when processing
>> ops.
>>
>> Except image state changed problem, another concern is size. Image
>> index only permit single client write, but resize/flatten/rollback ops
>> are allowed to happen concurrently. For simply, now these ops don't
>> change and save states into rados. Resize op will affect "size" and
>> current write client will be notified.
>>
>> Below listing object state change scenes:
>> 1. When clone from image, it will mark all objects as "parent"
>
>
> We can do this for any new image (mark all objects non-existent
> initially), not just clones.
>
>
>> 2. When creating snapshot, image index will be freeze and save it as the
>> index of the snapshot. All "parent" state objects will be marked as
>> "unknown" for safe
>
>
> One implementation detail here - I think it'd be good to store snapshot
> indexes in separate objects so that having many snapshots does not cause
> the header object to become too large.

Good point!

>
> To recover from a partial snapshot, where the header was updated but
> the index wasn't saved, it would be useful to have a rbd cli command to
> repair (or optionally just check) the index for a snapshot.

Yes, it's a good idea.

>
>
>> 3. When write(including modified op) a object, it will mark this object
>> as "local"
>> 4. When reading object, the current image object will be always read in
>> spite of the state. And the parent image's object will be checked and
>> trust the state. If exists parent image but read local object
>> successfully, the local object will be marked "local"
>>
>> The principle is that only exists one client can operate and save the
>> ImageIndex changes into rados. Now the challenge is that how to decide
>> who is the owner of ImageIndex. Like VM usage, qemu will open image to
>> do IO ops and externally user also can do create snapshot for the
>> opened image(rbd snap create ...). So it will exists two client modify
>> the same image, one can be regarded as "owner client" another is
>> "management client". "management client" is expected to not change the
>> state of object.
>>
>> I only come up with a idea that user need to call "set_owner_image"
>> when client want to become "owner client" and this client can operate
>> ImageIndex.
>
>
> This is a great idea for all header updates. I've been thinking about
> the same thing for rbd mirroring, and Sage started working on some
> watch/notify changes that will help make this robust:

Hmm, I will dive into it.

>
> http://pad.ceph.com/p/watch-notify
>
>
>> PR(https://github.com/ceph/ceph/pull/2212)
>> I have passed test_librbd and test_librbd_fsx tests. The IO logic
>> seemed worked as expected.
>
>
> I haven't had time to look at the code in a lot of detail yet, but
> it'd be nice to use just 2 bits per object on disk. I'm on vacation
> the next couple weeks, but Sage may have more feedback.
>
> Josh

Thanks for your reply, Josh!

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html