On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@xxxxxxxxxxxxxx> wrote: > On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> One idea we've talked a fair bit about is layering RBD images. The idea >> would be to create a new image in O(1) time that mirrors on old image and >> get copy-on-write type semantics, like a writeable snapshot. >> >> We've come up with a few different approaches for doing this, each with >> somewhat different performance characteristics. The main consideration is >> that RBD images do not (currently) have an "allocation table." Image data >> is simply striped over objects (that may or may not exist). You read the >> object for a given block to see if it exists; if it doesn't (a "hole"), >> the content is defined to be zero-filled. > > Have we thought about the hash table based approach yet? Where every > block gets hashed and we only store one copy for each? I guess this is > basically how git works, except instead of fixed-size blocks, it > tracks variable-sized blobs. This is also how ZFS dedupe works. > > The nice thing about the hash table based approach is that you don't > have to track parent-child relationships explicitly. If two users > happen to both install Centos 5.5 with the same settings on the same > sized-image, they'll both be deduped automatically. How would you place the blocks in a CAS-based block device like this? An allocation table might feel ugly, but when you're doing cluster-wide block sharing you're going to need the extra metadata somewhere. Better to store an allocation table than try and maintain the coherency required for dynamic de-dup like that. I guess I should say that de-dup would be a nice feature to support, but I don't think it's appropriate to implement as part of RBD. Anything that powerful needs to be a core RADOS feature. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html