One idea we've talked a fair bit about is layering RBD images. The idea would be to create a new image in O(1) time that mirrors on old image and get copy-on-write type semantics, like a writeable snapshot. We've come up with a few different approaches for doing this, each with somewhat different performance characteristics. The main consideration is that RBD images do not (currently) have an "allocation table." Image data is simply striped over objects (that may or may not exist). You read the object for a given block to see if it exists; if it doesn't (a "hole"), the content is defined to be zero-filled. (I'll use the term "block" and "object" interchangeably to mean the object that stores each RBD block. They're 4MB by default, but can be set to any size you want at image creation time.) 1- copy-up on first write - reads - read child image object. if it doesn't exist, read parent block. -> reads to unchanged data are slower - writes - write to child image block. if it doesn't exist, OSD will return ENOENT. the client would do a copy up (copy parent block to child block), and then redo the write. -> first writes are slow, especially if the block existed in the parent. - trim/discard - truncate the child object to zero, but do not delete it. 2- sparse objects - make the OSDs maintain allocation metadata for each objects so that we know which parts of the object are defined and which are holes (a relatively easy thing to do). - writes - write to modified region of child object. - reads - read child image object AND allocation map. read parent object for any holes (or when child object doesn't exist) -> more efficient data transfer when objects are sparse. -> reads to unchanged data are slower (as above) - trim/discard - need to somehow distinguish between a hole that falls-thru to parent and a hole that is defined to be zero by the child image. In both cases, we could add a(n optional) allocation bitmap to the parent image to avoid the fall-thru for parts of the images that aren't defined by the child image. That could be an explicit step taken by an adminstrator (e.g. after marking the parent read-only) to improve performance for overlayed images. (Maintaining a consistent bitmap for all images is non-trivial, and would slow things down considerably.) A few use cases for all of this: - "golden" VM images - writeable snapshots - image migration between pools - pause io - mark parent read-only - create "child" image - unpause io, redirect to the new child (these steps are all fast and O(1)!) - asynchronously copy-up parent blocks to the child (this is O(n)) - once this is done, remove the child's parent reference and discard the parent sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html