Re: rbd layering

Colin McCabe <cmccabe@xxxxxxxxxxxxxx> · Tue, 1 Feb 2011 23:13:34 -0800

On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> One idea we've talked a fair bit about is layering RBD images.  The idea
> would be to create a new image in O(1) time that mirrors on old image and
> get copy-on-write type semantics, like a writeable snapshot.
>
> We've come up with a few different approaches for doing this, each with
> somewhat different performance characteristics.  The main consideration is
> that RBD images do not (currently) have an "allocation table."  Image data
> is simply striped over objects (that may or may not exist).  You read the
> object for a given block to see if it exists; if it doesn't (a "hole"),
> the content is defined to be zero-filled.

Have we thought about the hash table based approach yet? Where every
block gets hashed and we only store one copy for each? I guess this is
basically how git works, except instead of fixed-size blocks, it
tracks variable-sized blobs. This is also how ZFS dedupe works.

The nice thing about the hash table based approach is that you don't
have to track parent-child relationships explicitly. If two users
happen to both install Centos 5.5 with the same settings on the same
sized-image, they'll both be deduped automatically.

The disadvantage, of course, is that you need to hash the blocks.
Also, there's some tiny probability that there will be a hash
collision. You could use a long hash key or do hash chaining to
mitigate this, of course.

The big disadvantage of the allocation table-based approaches, at
least in my mind, is that they don't feel very block device-y.
Allocation maps are things that normally go in a file system rather
than in a block device.

If we do go with an allocation-table based approach, what would the
API look like from the administrator's point of view? I guess I
imagine some kind of API where I create a child RBD block device from
a parent RBD device. Then whenever I wrote to the child image, it
would "re-dupe" the two block devices. (It seems like the amount of
sharing would start at 100% and just go down from there... unless my
analysis is missing something?)

Another possibility is that we could simply run qcow2 over rbd. qcow2
already implements copy-on-write at a higher level of the stack.

I took a quick look at the qcow2 image format at:
http://people.gnome.org/~markmc/qcow-image-format.html

It looks suspiciously like something I've seen before :)
http://en.wikipedia.org/wiki/Inode_pointer_structure

sincerely,
Colin

>
> (I'll use the term "block" and "object" interchangeably to mean the object
> that stores each RBD block.  They're 4MB by default, but can be set to any
> size you want at image creation time.)
>
> 1- copy-up on first write
>  - reads
>    - read child image object.  if it doesn't exist, read parent block.
>    -> reads to unchanged data are slower
>  - writes
>    - write to child image block.  if it doesn't exist, OSD will return
>      ENOENT.  the client would do a copy up (copy parent block to child
>      block), and then redo the write.
>    -> first writes are slow, especially if the block existed in the parent.
>  - trim/discard
>    - truncate the child object to zero, but do not delete it.
>
> 2- sparse objects
>  - make the OSDs maintain allocation metadata for each objects so that we
>    know which parts of the object are defined and which are holes (a
>    relatively easy thing to do).
>  - writes
>    - write to modified region of child object.
>  - reads
>    - read child image object AND allocation map.  read parent object for
>      any holes (or when child object doesn't exist)
>    -> more efficient data transfer when objects are sparse.
>    -> reads to unchanged data are slower (as above)
>  - trim/discard
>    - need to somehow distinguish between a hole that falls-thru to parent
>      and a hole that is defined to be zero by the child image.
>
> In both cases, we could add a(n optional) allocation bitmap to the parent
> image to avoid the fall-thru for parts of the images that aren't defined
> by the child image.  That could be an explicit step taken by an
> adminstrator (e.g. after marking the parent read-only) to improve
> performance for overlayed images.  (Maintaining a consistent bitmap for
> all images is non-trivial, and would slow things down considerably.)
>
> A few use cases for all of this:
>  - "golden" VM images
>  - writeable snapshots
>  - image migration between pools
>   - pause io
>   - mark parent read-only
>   - create "child" image
>   - unpause io, redirect to the new child
>   (these steps are all fast and O(1)!)
>   - asynchronously copy-up parent blocks to the child (this is O(n))
>   - once this is done, remove the child's parent reference and discard
>     the parent
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html