We discussed a great deal of this during the initial format 2 work as well, when we were thinking about having bitmaps of allocated space. (Although we also have interval sets which might be a better fit?) I think there was more thought behind it than is in the copy-on-read blueprint; do you know if we have it written down anywhere, Josh? -Greg On Tue, Jun 10, 2014 at 12:38 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote: > On Tue, 10 Jun 2014 14:52:54 +0800 > Haomai Wang <haomaiwang@xxxxxxxxx> wrote: > >> Thanks, Josh! >> >> Your points are really helpful. Maybe we can schedule this bp to the >> near CDS? The implementation I hope can has great performance effects >> on librbd. > > It'd be great to discuss it more at CDS. Could you add a blueprint for > it on the wiki: > > https://wiki.ceph.com/Planning/Blueprints/Submissions > > Josh > >> On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin >> <josh.durgin@xxxxxxxxxxx> wrote: >> > On 06/05/2014 12:01 AM, Haomai Wang wrote: >> >> Hi, >> >> Previously I sent a mail about the difficult of rbd snapshot size >> >> statistic. The main solution is using object map to store the >> >> changes. The problem is we can't handle with multi client >> >> concurrent modify. >> >> >> >> Lack of object map(like pointer map in qcow2), it cause many >> >> problems in librbd. Such as clone depth, the deep clone depth will >> >> cause remarkable latency. Usually each clone wrap will increase >> >> two times of latency. >> >> >> >> I consider to make a tradeoff between multi-client support and >> >> single-client support for librbd. In practice, most of the >> >> volumes/images are used by VM, there only exist one client will >> >> access/modify image. We can't only want to make shared image >> >> possible but make most of use cases bad. So we can add a new flag >> >> called "shared" when creating image. If "shared" is false, librbd >> >> will maintain a object map for each image. The object map is >> >> considered to durable, each image_close call will store the map >> >> into rados. If the client is crashed and failed to dump the >> >> object map, the next client open the image will think the object >> >> map as out of date and reset the objectmap. >> >> >> >> We can easily find the advantage of this feature: >> >> 1. Avoid clone performance problem >> >> 2. Make snapshot statistic possible >> >> 3. Improve librbd operation performance including read, >> >> copy-on-write operation. >> >> >> >> What do you think above? More feedbacks are appreciate! >> > >> > I think it's a great idea! We discussed this a little at the last >> > cds [1]. I like the idea of the shared flag on an image. Since the >> > vastly more common case is single-client, I'd go further and >> > suggest that we treat images as if shared is false by default if >> > the flag is not present (perhaps with a config option to change >> > this default behavior). >> > >> > That way existing images can benefit from the feature without extra >> > configuration. There can be an rbd command to toggle the shared >> > flag as well, so users of ocfs2 or gfs2 or other >> > multi-client-writing systems can upgrade and set shared to true >> > before restarting their clients. >> > >> > Another thing to consider is the granularity of the object map. The >> > coarse granularity of a bitmap of object existence would be >> > simplest, and most useful for in-memory comparison for clones. For >> > statistics it might be desirable in the future to have a >> > finer-grained index of data existence in the image. To make that >> > easy to handle, the on-disk format could be a list of extents (byte >> > ranges). >> > >> > Another potential use case would be a mode in which the index is >> > treated as authoritative. This could make discard very fast, for >> > example. I'm not sure it could be done safely with only binary >> > 'exists/does not exist' information though - a third 'unknown' state >> > might be needed for some cases. If this kind of index is actually >> > useful (I'm not sure there are cases where the performance penalty >> > would be worth it), we could add a new index format if we need it. >> > >> > Back to the currently proposed design, to be safe with live >> > migration we'd need to make sure the index is consistent in the >> > destination process. Using rados_notify() after we set the clean >> > flag on the index can make the destination vm re-read the index >> > before any I/O happens. This might be a good time to introduce a >> > data payload to the notify as well, so we can only re-read the >> > index, instead of all the header metadata. Rereading the index >> > after cache invalidation and wiring that up through qemu's >> > bdrv_invalidate() would be even better. >> > >> > There's more to consider in implementing this wrt snapshots, but >> > this email has gone on long enough. >> > >> > Josh >> > >> > [1] http://pad.ceph.com/p/cdsgiant-rbd-copy-on-read-for-clones >> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html