On Tue, Feb 1, 2011 at 11:24 PM, Gregory Farnum <gregf@xxxxxxxxxxxxxxx> wrote: > On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@xxxxxxxxxxxxxx> wrote: >> On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >>> One idea we've talked a fair bit about is layering RBD images. The idea >>> would be to create a new image in O(1) time that mirrors on old image and >>> get copy-on-write type semantics, like a writeable snapshot. >>> >>> We've come up with a few different approaches for doing this, each with >>> somewhat different performance characteristics. The main consideration is >>> that RBD images do not (currently) have an "allocation table." Image data >>> is simply striped over objects (that may or may not exist). You read the >>> object for a given block to see if it exists; if it doesn't (a "hole"), >>> the content is defined to be zero-filled. >> >> Have we thought about the hash table based approach yet? Where every >> block gets hashed and we only store one copy for each? I guess this is >> basically how git works, except instead of fixed-size blocks, it >> tracks variable-sized blobs. This is also how ZFS dedupe works. >> >> The nice thing about the hash table based approach is that you don't >> have to track parent-child relationships explicitly. If two users >> happen to both install Centos 5.5 with the same settings on the same >> sized-image, they'll both be deduped automatically. > How would you place the blocks in a CAS-based block device like this? > An allocation table might feel ugly, but when you're doing > cluster-wide block sharing you're going to need the extra metadata > somewhere. Better to store an allocation table than try and maintain > the coherency required for dynamic de-dup like that. You could chunk the hash table over several OSDs. Then you only need to worry about doing atomic operations on a given hash table entry, which will of course be protected by a single PG lock. Yehuda is probably right though... it's not 100% clear that the benefits outweigh the disadvantages, given that it would need an extra lookup for every operation. In the end it's something that probably will take some experimentation to get right. Colin -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html