Re: rbd layering

Gregory Farnum <gregf@xxxxxxxxxxxxxxx> · Tue, 1 Feb 2011 23:24:51 -0800



On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@xxxxxxxxxxxxxx> wrote:
> On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> One idea we've talked a fair bit about is layering RBD images.  The idea
>> would be to create a new image in O(1) time that mirrors on old image and
>> get copy-on-write type semantics, like a writeable snapshot.
>>
>> We've come up with a few different approaches for doing this, each with
>> somewhat different performance characteristics.  The main consideration is
>> that RBD images do not (currently) have an "allocation table."  Image data
>> is simply striped over objects (that may or may not exist).  You read the
>> object for a given block to see if it exists; if it doesn't (a "hole"),
>> the content is defined to be zero-filled.
>
> Have we thought about the hash table based approach yet? Where every
> block gets hashed and we only store one copy for each? I guess this is
> basically how git works, except instead of fixed-size blocks, it
> tracks variable-sized blobs. This is also how ZFS dedupe works.
>
> The nice thing about the hash table based approach is that you don't
> have to track parent-child relationships explicitly. If two users
> happen to both install Centos 5.5 with the same settings on the same
> sized-image, they'll both be deduped automatically.
How would you place the blocks in a CAS-based block device like this?
An allocation table might feel ugly, but when you're doing
cluster-wide block sharing you're going to need the extra metadata
somewhere. Better to store an allocation table than try and maintain
the coherency required for dynamic de-dup like that.

I guess I should say that de-dup would be a nice feature to support,
but I don't think it's appropriate to implement as part of RBD.
Anything that powerful needs to be a core RADOS feature.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html