Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 10 Jun 2014 21:01:31 -0700



We discussed a great deal of this during the initial format 2 work as
well, when we were thinking about having bitmaps of allocated space.
(Although we also have interval sets which might be a better fit?) I
think there was more thought behind it than is in the copy-on-read
blueprint; do you know if we have it written down anywhere, Josh?
-Greg

On Tue, Jun 10, 2014 at 12:38 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote:
> On Tue, 10 Jun 2014 14:52:54 +0800
> Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
>
>> Thanks, Josh!
>>
>> Your points are really helpful. Maybe we can schedule this bp to the
>> near CDS? The implementation I hope can has great performance effects
>> on librbd.
>
> It'd be great to discuss it more at CDS. Could you add a blueprint for
> it on the wiki:
>
> https://wiki.ceph.com/Planning/Blueprints/Submissions
>
> Josh
>
>> On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin
>> <josh.durgin@xxxxxxxxxxx> wrote:
>> > On 06/05/2014 12:01 AM, Haomai Wang wrote:
>> >> Hi,
>> >> Previously I sent a mail about the difficult of rbd snapshot size
>> >> statistic. The main solution is using object map to store the
>> >> changes. The problem is we can't handle with multi client
>> >> concurrent modify.
>> >>
>> >> Lack of object map(like pointer map in qcow2), it cause many
>> >> problems in librbd. Such as clone depth, the deep clone depth will
>> >> cause remarkable latency. Usually each clone wrap will increase
>> >> two times of latency.
>> >>
>> >> I consider to make a tradeoff between multi-client support and
>> >> single-client support for librbd. In practice, most of the
>> >> volumes/images are used by VM, there only exist one client will
>> >> access/modify image. We can't only want to make shared image
>> >> possible but make most of use cases bad. So we can add a new flag
>> >> called "shared" when creating image. If "shared" is false, librbd
>> >> will maintain a object map for each image. The object map is
>> >> considered to durable, each image_close call will store the map
>> >> into rados. If the client  is crashed and failed to dump the
>> >> object map, the next client open the image will think the object
>> >> map as out of date and reset the objectmap.
>> >>
>> >> We can easily find the advantage of this feature:
>> >> 1. Avoid clone performance problem
>> >> 2. Make snapshot statistic possible
>> >> 3. Improve librbd operation performance including read,
>> >> copy-on-write operation.
>> >>
>> >> What do you think above? More feedbacks are appreciate!
>> >
>> > I think it's a great idea! We discussed this a little at the last
>> > cds [1]. I like the idea of the shared flag on an image. Since the
>> > vastly more common case is single-client, I'd go further and
>> > suggest that we treat images as if shared is false by default if
>> > the flag is not present (perhaps with a config option to change
>> > this default behavior).
>> >
>> > That way existing images can benefit from the feature without extra
>> > configuration. There can be an rbd command to toggle the shared
>> > flag as well, so users of ocfs2 or gfs2 or other
>> > multi-client-writing systems can upgrade and set shared to true
>> > before restarting their clients.
>> >
>> > Another thing to consider is the granularity of the object map. The
>> > coarse granularity of a bitmap of object existence would be
>> > simplest, and most useful for in-memory comparison for clones. For
>> > statistics it might be desirable in the future to have a
>> > finer-grained index of data existence in the image. To make that
>> > easy to handle, the on-disk format could be a list of extents (byte
>> > ranges).
>> >
>> > Another potential use case would be a mode in which the index is
>> > treated as authoritative. This could make discard very fast, for
>> > example. I'm not sure it could be done safely with only binary
>> > 'exists/does not exist' information though - a third 'unknown' state
>> > might be needed for some cases. If this kind of index is actually
>> > useful (I'm not sure there are cases where the performance penalty
>> > would be worth it), we could add a new index format if we need it.
>> >
>> > Back to the currently proposed design, to be safe with live
>> > migration we'd need to make sure the index is consistent in the
>> > destination process. Using rados_notify() after we set the clean
>> > flag on the index can make the destination vm re-read the index
>> > before any I/O happens. This might be a good time to introduce a
>> > data payload to the notify as well, so we can only re-read the
>> > index, instead of all the header metadata. Rereading the index
>> > after cache invalidation and wiring that up through qemu's
>> > bdrv_invalidate() would be even better.
>> >
>> > There's more to consider in implementing this wrt snapshots, but
>> > this email has gone on long enough.
>> >
>> > Josh
>> >
>> > [1] http://pad.ceph.com/p/cdsgiant-rbd-copy-on-read-for-clones
>>
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html