On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > One idea we've talked a fair bit about is layering RBD images. The idea > would be to create a new image in O(1) time that mirrors on old image and > get copy-on-write type semantics, like a writeable snapshot. > > We've come up with a few different approaches for doing this, each with > somewhat different performance characteristics. The main consideration is > that RBD images do not (currently) have an "allocation table." Image data > is simply striped over objects (that may or may not exist). You read the > object for a given block to see if it exists; if it doesn't (a "hole"), > the content is defined to be zero-filled. Have we thought about the hash table based approach yet? Where every block gets hashed and we only store one copy for each? I guess this is basically how git works, except instead of fixed-size blocks, it tracks variable-sized blobs. This is also how ZFS dedupe works. The nice thing about the hash table based approach is that you don't have to track parent-child relationships explicitly. If two users happen to both install Centos 5.5 with the same settings on the same sized-image, they'll both be deduped automatically. The disadvantage, of course, is that you need to hash the blocks. Also, there's some tiny probability that there will be a hash collision. You could use a long hash key or do hash chaining to mitigate this, of course. The big disadvantage of the allocation table-based approaches, at least in my mind, is that they don't feel very block device-y. Allocation maps are things that normally go in a file system rather than in a block device. If we do go with an allocation-table based approach, what would the API look like from the administrator's point of view? I guess I imagine some kind of API where I create a child RBD block device from a parent RBD device. Then whenever I wrote to the child image, it would "re-dupe" the two block devices. (It seems like the amount of sharing would start at 100% and just go down from there... unless my analysis is missing something?) Another possibility is that we could simply run qcow2 over rbd. qcow2 already implements copy-on-write at a higher level of the stack. I took a quick look at the qcow2 image format at: http://people.gnome.org/~markmc/qcow-image-format.html It looks suspiciously like something I've seen before :) http://en.wikipedia.org/wiki/Inode_pointer_structure sincerely, Colin > > (I'll use the term "block" and "object" interchangeably to mean the object > that stores each RBD block. They're 4MB by default, but can be set to any > size you want at image creation time.) > > 1- copy-up on first write > - reads > - read child image object. if it doesn't exist, read parent block. > -> reads to unchanged data are slower > - writes > - write to child image block. if it doesn't exist, OSD will return > ENOENT. the client would do a copy up (copy parent block to child > block), and then redo the write. > -> first writes are slow, especially if the block existed in the parent. > - trim/discard > - truncate the child object to zero, but do not delete it. > > 2- sparse objects > - make the OSDs maintain allocation metadata for each objects so that we > know which parts of the object are defined and which are holes (a > relatively easy thing to do). > - writes > - write to modified region of child object. > - reads > - read child image object AND allocation map. read parent object for > any holes (or when child object doesn't exist) > -> more efficient data transfer when objects are sparse. > -> reads to unchanged data are slower (as above) > - trim/discard > - need to somehow distinguish between a hole that falls-thru to parent > and a hole that is defined to be zero by the child image. > > In both cases, we could add a(n optional) allocation bitmap to the parent > image to avoid the fall-thru for parts of the images that aren't defined > by the child image. That could be an explicit step taken by an > adminstrator (e.g. after marking the parent read-only) to improve > performance for overlayed images. (Maintaining a consistent bitmap for > all images is non-trivial, and would slow things down considerably.) > > A few use cases for all of this: > - "golden" VM images > - writeable snapshots > - image migration between pools > - pause io > - mark parent read-only > - create "child" image > - unpause io, redirect to the new child > (these steps are all fast and O(1)!) > - asynchronously copy-up parent blocks to the child (this is O(n)) > - once this is done, remove the child's parent reference and discard > the parent -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html