Hi all, I have viewed the discuss video on Ceph CDS. By the way, sorry for the absence because of something urgent. It seemed that we have two ways to implement it, one is lightweight another is complex. I like the simple one which prefer invalidating cache and let librbd reload/lazy load object state. And the most important one is implementing a performance optimized Index(ObjectMap). Is there exists progress Josh? I think we could push further based on discuss. Or I missed something? On Wed, Jun 11, 2014 at 12:01 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > We discussed a great deal of this during the initial format 2 work as > well, when we were thinking about having bitmaps of allocated space. > (Although we also have interval sets which might be a better fit?) I > think there was more thought behind it than is in the copy-on-read > blueprint; do you know if we have it written down anywhere, Josh? > -Greg > > On Tue, Jun 10, 2014 at 12:38 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote: >> On Tue, 10 Jun 2014 14:52:54 +0800 >> Haomai Wang <haomaiwang@xxxxxxxxx> wrote: >> >>> Thanks, Josh! >>> >>> Your points are really helpful. Maybe we can schedule this bp to the >>> near CDS? The implementation I hope can has great performance effects >>> on librbd. >> >> It'd be great to discuss it more at CDS. Could you add a blueprint for >> it on the wiki: >> >> https://wiki.ceph.com/Planning/Blueprints/Submissions >> >> Josh >> >>> On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin >>> <josh.durgin@xxxxxxxxxxx> wrote: >>> > On 06/05/2014 12:01 AM, Haomai Wang wrote: >>> >> Hi, >>> >> Previously I sent a mail about the difficult of rbd snapshot size >>> >> statistic. The main solution is using object map to store the >>> >> changes. The problem is we can't handle with multi client >>> >> concurrent modify. >>> >> >>> >> Lack of object map(like pointer map in qcow2), it cause many >>> >> problems in librbd. Such as clone depth, the deep clone depth will >>> >> cause remarkable latency. Usually each clone wrap will increase >>> >> two times of latency. >>> >> >>> >> I consider to make a tradeoff between multi-client support and >>> >> single-client support for librbd. In practice, most of the >>> >> volumes/images are used by VM, there only exist one client will >>> >> access/modify image. We can't only want to make shared image >>> >> possible but make most of use cases bad. So we can add a new flag >>> >> called "shared" when creating image. If "shared" is false, librbd >>> >> will maintain a object map for each image. The object map is >>> >> considered to durable, each image_close call will store the map >>> >> into rados. If the client is crashed and failed to dump the >>> >> object map, the next client open the image will think the object >>> >> map as out of date and reset the objectmap. >>> >> >>> >> We can easily find the advantage of this feature: >>> >> 1. Avoid clone performance problem >>> >> 2. Make snapshot statistic possible >>> >> 3. Improve librbd operation performance including read, >>> >> copy-on-write operation. >>> >> >>> >> What do you think above? More feedbacks are appreciate! >>> > >>> > I think it's a great idea! We discussed this a little at the last >>> > cds [1]. I like the idea of the shared flag on an image. Since the >>> > vastly more common case is single-client, I'd go further and >>> > suggest that we treat images as if shared is false by default if >>> > the flag is not present (perhaps with a config option to change >>> > this default behavior). >>> > >>> > That way existing images can benefit from the feature without extra >>> > configuration. There can be an rbd command to toggle the shared >>> > flag as well, so users of ocfs2 or gfs2 or other >>> > multi-client-writing systems can upgrade and set shared to true >>> > before restarting their clients. >>> > >>> > Another thing to consider is the granularity of the object map. The >>> > coarse granularity of a bitmap of object existence would be >>> > simplest, and most useful for in-memory comparison for clones. For >>> > statistics it might be desirable in the future to have a >>> > finer-grained index of data existence in the image. To make that >>> > easy to handle, the on-disk format could be a list of extents (byte >>> > ranges). >>> > >>> > Another potential use case would be a mode in which the index is >>> > treated as authoritative. This could make discard very fast, for >>> > example. I'm not sure it could be done safely with only binary >>> > 'exists/does not exist' information though - a third 'unknown' state >>> > might be needed for some cases. If this kind of index is actually >>> > useful (I'm not sure there are cases where the performance penalty >>> > would be worth it), we could add a new index format if we need it. >>> > >>> > Back to the currently proposed design, to be safe with live >>> > migration we'd need to make sure the index is consistent in the >>> > destination process. Using rados_notify() after we set the clean >>> > flag on the index can make the destination vm re-read the index >>> > before any I/O happens. This might be a good time to introduce a >>> > data payload to the notify as well, so we can only re-read the >>> > index, instead of all the header metadata. Rereading the index >>> > after cache invalidation and wiring that up through qemu's >>> > bdrv_invalidate() would be even better. >>> > >>> > There's more to consider in implementing this wrt snapshots, but >>> > this email has gone on long enough. >>> > >>> > Josh >>> > >>> > [1] http://pad.ceph.com/p/cdsgiant-rbd-copy-on-read-for-clones >>> >>> >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html