RBD layering design draft

Josh Durgin <josh.durgin@xxxxxxxxxxx> · Fri, 15 Jun 2012 13:48:59 -0700

Here's a draft of a patch to the docs outlining the rbd layering
design. Is anything unclear? Any suggestions for improvement?

Josh

============
RBD Layering
============

RBD layering refers to the creation of copy-on-write clones of block
devices. This allows for fast image creation, for example to clone a
golden master image of a virtual machine into a new instance. To
simplify the semantics, you can only create a clone of a snapshot -
snapshots are always read-only, so the rest of the image is
unaffected, and there's no possibility of writing to them
accidentally.

Note: the terms `child` and `parent` below mean an rbd image created
by cloning, and the rbd image snapshot a child was cloned from.

Command line interface
----------------------

Before cloning a snapshot, you must mark it as preserved, to prevent
it from being deleted while child images refer to it:
::

    $ rbd preserve pool/image@snap

Then you can perform the clone:
::

    $ rbd clone --parent pool/parent@snap pool2/child1

You can create a clone with different object sizes from the parent:
::

    $ rbd clone --parent pool/parent@snap --order 25 pool2/child2

To delete the parent, you must first mark it unpreserved, which checks
that there are no children left:
::

    $ rbd unpreserve pool/image@snap
    Error unpreserving: child images rely on this image
    $ rbd list_children pool/image@snap
    pool2/child1
    pool2/child2
    $ rbd copyup pool2/child1
    $ rbd rm pool2/child2
    $ rbd unpreserve pool/image@snap

Then the snapshot can be deleted like normal:
::

    $ rbd snap rm pool/image@snap

Implementation
--------------

Data Flow
^^^^^^^^^

In the initial implementation, called 'trivial layering', there will
be no tracking of which objects exist in a clone. A read that hits a
non-existent object will attempt to read from the parent object, and
this will continue recursively until an object exists or an image with
no parent is found.

Before a write is performed, the object is checked for existence. If
it doesn't exist, a copy-up operation is performed, which means
reading the relevant range of data from the parent image and writing
it (plus the original write) to the child image. To prevent races with
multiple writes trying to copy-up the same object, this copy-up
operation will include an atomic create. If the atomic create fails,
the original write is done instead. This copy-up operation is
implemented as a class method so that extra metadata can be stored by
it in the future.

A future optimization could be storing a bitmap of which objects
actually exist in a child. This would obviate the check for existence
before each write, and let reads go directly to the parent if needed.

Parent/Child relationships
^^^^^^^^^^^^^^^^^^^^^^^^^^

Children store a reference to their parent in their header, as a tuple
of (pool id, image id, snapshot id). This is enough information to
open the parent and read from it.

In addition to knowing which parent a given image has, we want to be
able to tell if a preserved image still has children. This is
accomplished with a new per-pool object, `rbd_children`, which maps
(parent pool, parent id, parent snapshot id) to a list of child
image ids. This is stored in the same pool as the child image
because the client creating a clone already has read/write access to
everything in this pool. This lets a client with read-only access to
one pool clone a snapshot from that pool into a pool they have full
access to. It increases the cost of unpreserving an image, since this
needs to check for children in every pool, but this is a rare
operation. It would likely only be done before removing old images,
which is already much more expensive because it involves deleting
every data object in the image.

Preservation
^^^^^^^^^^^^

Internally, preservation_state is a field in the header object that
can be in three states. "preserved", "unpreserved", and
"unpreserving". The first two are set as the result of "rbd
preserve/unpreserve". The "unpreserving" state is set while the "rbd
unpreserve" command checks for any child images. Only snapshots in the
"preserved" state may be cloned, so the "unpreserving" state prevents
a race like:

1. A: walk through all pools, look for clones, find none
2. B: create a clone
3. A: unpreserve parent
4. A: rbd snap rm pool/parent@snap

Resizing
^^^^^^^^

To support resizing of layered images, we need to keep track of the
minimum size the image ever was, so that if a child image is shrunk
and then expanded, the re-expanded space is treated as unused instead
of being read from the parent image. Since this can change over time,
we need to store this for each snapshot as well.

Renaming
^^^^^^^^

Currently the rbd header object (that stores all the metadata about an
image) is named after the name of the image. This makes renaming
disrupt clients who have the image open (such as children reading from
a parent image). To avoid this, we can name the header object by the
id of the image, which does not change. That is, the name of the
header object could be `rbd_header.$id`, where $id is a unique id for
the image in the pool.

When a client opens an image, all it knows is the name. There is
already a per-pool `rbd_directory` object that maps image names to
ids, but if we relied on it to get the id, we could not open any
images in that pool if that single object was unavailable. To avoid
this dependency, we can store the id of an image in an object called
`rbd_id.$image_name`, where $image_name is the name of the image. The
per-pool `rbd_directory` object is still useful for listing all images
in a pool, however.

Header changes
--------------

The header needs a few new fields:

* uint64_t parent_pool_id
* string parent_image_id
* uint64_t parent_snap_id
* uint64_t min_size (smallest size the image ever was in bytes)
* bool has_parent

Note that all the image ids are strings instead of uint64_t to let us
easily switch to uuids in the future.

cls_rbd
^^^^^^^

Some new methods are needed:
::

    /***************** methods on the rbd header *********************/
    /**
     * Sets the parent, min_size, and has_parent keys.
     * Fails if any of these keys exist, since the image already
     * had a parent.
     */
    set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)

    /**
     * Returns the parent pool id, image id, and snap id, or -ENOENT
     * if has_parent is false
     */
    get_parent(uint64_t snapid)

    /**
     * Set has_parent to false.
     */
    remove_parent() // after all parent data is copied to the child

    /*************** methods on the rbd_children object *****************/

    add_child(uint64_t parent_pool_id, string parent_image_id,
              uint64_t parent_snap_id, string image_id);
    remove_child(uint64_t parent_pool_id, string parent_image_id,
                 uint64_t parent_snap_id, string image_id);
    /**
     * List image ids of a given parent
     */
    get_children(uint64_t parent_pool_id, string parent_image_id,
                 uint64_t parent_snap_id, uint64_t max_return,
                 string start);
    /**
     * List parent images
     */
    get_parents(uint64_t max_return, uint64_t start_pool_id,
                string start_image_id, string start_snap_id);

    /************ methods on the rbd_id.$image_name object **************/
    /**
     * Create the object and set the id. Fail and return -EEXIST if
     * the object exists.
     */
    create_id(string id)
    get_id()

    /***************** methods on the rbd_data objects ******************/
    /**
     * Create an object with parent_data as its contents,
     * then write child_data to it. If the exclusive create fails,
     * just write the child_data.
     */
     copy_up(char *parent_data, uint64_t parent_data_len,
             char *child_data, uint64_t child_data_offset,
             uint64_t child_data_length)

One existing method will change if the image supports
layering:
::

    snapshot_add - stores current min_size and has_parent with
                   other snapshot metadata (images that don't have
                   layering enabled aren't affected)

librbd
^^^^^^

Opening a child image opens its parent (and this will continue
recursively as needed). This means that an ImageCtx will contain a
pointer to the parent image context. Differing object sizes won't
matter, since reading from the parent will go through the parent
image context.

Discard will need to change for layered images so that it only
truncates objects, and does not remove them. If we removed objects, we
could not tell if we needed to read them from the parent.

A new clone method will be added, which takes the same arguments as
create except size (size of the parent image is used).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html