Re: RBD format changes and layering

Yehuda Sadeh <yehuda@xxxxxxxxxxx> · Thu, 24 May 2012 16:39:24 -0700

On Thu, May 24, 2012 at 4:05 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote:
> RBD object format changes
> =========================
>
> To enable us to add more features to rbd, including copy-on-write
> cloning via layering, we need to change to rbd header object
> format. Since this won't be backwards compatible, the old format will
> still be used by default. Once layering is implemented, the old format
> will be deprecated, but still usable with an extra option (something
> like rbd create --legacy ...). Clients will still be able to read the
> old format, and images can be converted by exporting and importing them.
>
> While we're making these changes, we can clean up the way librbd and
> the rbd kernel module access the header, so that they don't have to
> change each time we change the header format. Instead of reading the
> header directly, they can use the OSD class mechanism to interact with
> it. librbd already does this for snapshots, but kernel rbd reads the
> entire header directly. Making them both use a well-defined api will
> make later format additions much simpler. I'll describe the changes
> needed in general, and then those that are needed for rbd layering.
>
> New format, pre-layering
> ========================
>
> Right now the header object is name $image_name.rbd, and the data
> objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number.
> Since we're making other incompatible changes, we have a chance to
> rename these to be less likely to collide with other objects. Prefixing
> them with a more specific string will help, and will work well with
> a new security feature for layering discussed later. The new
> names are:
>
> rbd_header.$image_name
> rbd_data.$id.$object_number
>
> The new header will have the existing (used) fields of the old format as
> key/value pairs in an omap (this is the rados interface that stores
> key/value pairs in leveldb). Specifically, the existing fields are:
>
>  * object_prefix // previously known as block_name
>  * order         // bit shift to determine size of the data objects
>  * size          // total size of the image in bytes
>  * snap_seq      // latest snapshot id used with the image
>  * snapshots     // list of (snap_name, snap_id, image_size) tuples
>
> To make adding new things easier, there will be an additional
> 'features' field, which is a mask of the features used by the image.
> Clients will know whether they can use an image by checking if they
> support all the features the image uses that the osd reports as being
> incompatible (see get_info() below).
>
> RBD class interface
> ===================
>
> Here's a proposed basic interface - new features will
> add more functions and data to existing ones.
>
> /**
>  * Initialize the header with basic metadata.
>  * Extra features may initialize more fields in the future.
>  * Everything is stored as key/value pairs as omaps in the header object.
>  *
>  * If features the OSD does not understand are requested, -ENOSYS is
>  * returned.
>  */
> create(__le64 size, __le32 order, __le64 features)
>
> /**
>  * Get the metadata about the image required to do I/O
>  * to it. In the future this may include extra information for
>  * features that require it, like encryption/compression type.
>  * This extra data will be added at the end of the response, so
>  * clients that don't support it don't interpret it.
>  *
>  * Features that would require clients to be updated to access
>  * the image correctly (such as image bitmaps) are set in
>  * the incompat_features field. A client that doesn't understand
>  * those features will return an error when they try to open
>  * the image.
>  *
>  * The size and any extra information is read from the appropriate
>  * snapshot metadata, if snapid is not CEPH_NOSNAP.
>  *
>  * Returns __le64 size, __le64 order, __le64 features,
>  *         __le64 incompat_features, __le64 snapseq and
>  *         list of __le64 snapids
>  */
> get_info(__le64 snapid)
>
> /**
>  * Used when resizing the image. Sets the size in bytes.
>  */
> set_size(__le64 size)
>
> /**
>  * The same as the existing snap_add/snap_remove methods, but using the
>  * new format.
>  */
> snapshot_add(string snap_name, __le64 snap_id)
> snapshot_remove(string snap_name)
>
> /**
>  * list snapshots - like the existing snap_list, but
>  * can return a subset of them.
>  *
>  * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples
>  * (snap_id, snap_size) just like the current snap_list
>  */
> snapshot_list(__le64 max_len)
>
> /**
>  * The same as the existing method. Should only be called
>  * on the rbd_info object.
>  * Returns an id number to use for a new image.
>  */
> assign_bid()
>
>
> RBD layering
> ============
>
> The first step is to implement trivial layering, i.e.
> layering without bitmaps, as described at:
>
> http://marc.info/?l=ceph-devel&m=129867273303846&w=2
>
> There are a couple of things that complicate the implementation:
>
> 1) making sure parent images are not deleted when children still
>   refer to them
>
> A simple way to solve this is to add a reference count to the parent
> image. This can cause issues with partially deleted images, if the
> reference count is decremented more than once because the child
> image's header was only deleted the second time 'rbd rm' was run.
>
> To prevent this, a full list of children can be used. When an image is
> cloned, the new image is added to the list of children. When a child is
> deleted, it is removed from the list. Keeping this all in the parent
> image's header leads to the second issue:

I don't like the idea of writing anything to the parent. We can have
an orthogonal directory in a different location.

>
> 2) cloning an image into a different pool without giving the cloner
>   write access to the parent image's pool
>
> The current capabilities implemented with cephx only allow you to
> restrict users to reading, writing or executing class methods on a
> per-pool basis.
>
> For the child image in rbd, we need to be able to read the data
> objects of the parent image, but only interact with the parent image
> header through certain class methods, namely add_child and
> remove_child during cloning and deletion.
>
> One way to do this is adding a whitelist of class methods to the
> capabilities system, but this would be hard to manage as more class
> methods are added. A more manageable way is to give classes some
> string they can interpret as permissions however they wish. Combined
> with allowing clients to access objects matching certain prefixes,
> this can restrict access to the image header to going through the rbd
> class, but still allow allow read-only access to the data objects.

That sounds useful feature, however, it also sounds like a much bigger
hammer than you need for that specific problem.
As I said, I don't think we should modify the parent. You can keep
that list on a different object that relates to the parent (e.g.,
rbd_ref.$image_name), or in a central place. You can put that object
in a different pool, where the client has write permissions.

>
> If we change the names of the rbd header and data objects to start
> with rbd_header and rbd_data, respectively, we have something like:
>
> allow prefix rbd_header class rbd image-child pool=templates
> allow prefix rbd_data r pool=templates
>
> where 'image-child' is interpreted by the rbd class to mean 'only
> allow adding or removing a child'.
>
> The problem with this is that the restricted client can still remove
> any child, not just images it has access to. To get around this, we
> can give each image a randomly generated uuid, and store that in the
> child header and the parent's list of children. Then when someone
> calls remove_child, they must pass the uuid in addition to their pool,
> name, and snapshot, and it will only be processed if it matches the
> uuid in the parent header.
>
> One thing that's not addressed in the earlier design is how to make
> images read-only. The simplest way would be to only support layering
> on top of snapshots, which are read-only by definition.
>
> Another way would be to allow images to be set read-only or
> read-write, and disallow setting images with children read-write. Are
> there many use cases that would justify this second, more complicated
> way?

I think that explicitly setting a read-only flag on the parent (if not
set) is enough. No need to do other magic (see my above comments).

>
> Copy-up
> =======
>
> Another feature we want to include with layering is the ability to
> copy all remaining data from the parent image to the child image, to
> break the dependency of the latter on the former. This does not change
> snapshots that were taken earlier though - they still rely on the
> parent image. Thus, the children of a parent image will need to
> include snapshots as well, and the reference to the parent image will
> be needed to interact with snapshots. Thus, we can't just remove the
> information pointing the parent. Instead, we can add a boolean
> has_parent field that is stored in the header and with each snapshot,
> since some snapshots may be taken when the parent was still used, and
> some after all the data has been copied to the child.

To generalize: a snapshot context would contain the source rbd image.

>
> Renaming
> ========
>
> In order to support renaming layered images, we can use the id
> assigned to each image in place of the name. We just need to store a
> mapping from ids to names in each pool. Eventually this can replace
> rbd_directory, when we stop supporting the old format. This can't
> happen right now because clients assume rbd_directory is a tmap.
>
> Thus, the parent and child image lists would contain (pool name, image
> id, snapshot name) tuples. Pools and snapshots can't be renamed, so
> they don't have this problem. Image ids are unique within a pool, so
> (pool name, image id) uniquely identifies an image.
>
> Resizing
> ========
>
> To support resizing of layered images, we need to keep track of the
> minimum size the image ever was, so that if a child image is shrunk
> and then expanded, the re-expanded space is treated as unused instead
> of being read from the parent image. Since this can change over time,
> we need to store this for each snapshot as well.
>
> In summary, the format changes specific to adding layering are:
>
> New object
> ==========
>
> rbd_images_names // stores a mapping from image ids to image names
>
> New header fields
> =================
>
> * parent_pool, parent_image_id, parent_snapshot
> * uuid
> * children - tuples of (pool, image_id, snapshot)
> * min_size
> * has_parent
> * new fields in snapshots:
>  - min_size
>  - has_parent
>
> New rbd class methods
> =====================
>
> /**
>  * Sets the parent, min_size, and has_parent keys.
>  * Fails if any of these keys exist, since the image already
>  * had a parent.
>  */
> set_parent(string pool_name, __le64 image_id, string snap_name)
>
> /**
>  * Sets has_parent to false.
>  */
> remove_parent() // after all parent data is copied to the child
>
> /**
>  * uuid is required here to prevent malicious users from
>  * removing children they don't have access to.
>  */
> add_child(string pool, __le64 image_id, string snapname, string uuid)
> remove_child(string pool_name, __le64 image_id, string snapname, string
> uuid)
>
> /**
>  * to be run on the rbd_image_names object.
>  */
> get_name(image_id)
> set_name(image_id)
>
> Changes to existing class methods
> =================================
>
> The new snapshot fields will be added to the return value of snapshot_list.
> snapshot_add will need to fill them in.
>
> create will generate a uuid for the image.
>
> Does anyone have any thoughts on the design? Any ways to make it simpler?
>
> Josh
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html