On Thu, May 24, 2012 at 4:05 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote: > RBD object format changes > ========================= > > To enable us to add more features to rbd, including copy-on-write > cloning via layering, we need to change to rbd header object > format. Since this won't be backwards compatible, the old format will > still be used by default. Once layering is implemented, the old format > will be deprecated, but still usable with an extra option (something > like rbd create --legacy ...). Clients will still be able to read the > old format, and images can be converted by exporting and importing them. > > While we're making these changes, we can clean up the way librbd and > the rbd kernel module access the header, so that they don't have to > change each time we change the header format. Instead of reading the > header directly, they can use the OSD class mechanism to interact with > it. librbd already does this for snapshots, but kernel rbd reads the > entire header directly. Making them both use a well-defined api will > make later format additions much simpler. I'll describe the changes > needed in general, and then those that are needed for rbd layering. > > New format, pre-layering > ======================== > > Right now the header object is name $image_name.rbd, and the data > objects are named rb.$image_id_lowbits.$image_id_highbits.$object_number. > Since we're making other incompatible changes, we have a chance to > rename these to be less likely to collide with other objects. Prefixing > them with a more specific string will help, and will work well with > a new security feature for layering discussed later. The new > names are: > > rbd_header.$image_name > rbd_data.$id.$object_number > > The new header will have the existing (used) fields of the old format as > key/value pairs in an omap (this is the rados interface that stores > key/value pairs in leveldb). Specifically, the existing fields are: > > * object_prefix // previously known as block_name > * order // bit shift to determine size of the data objects > * size // total size of the image in bytes > * snap_seq // latest snapshot id used with the image > * snapshots // list of (snap_name, snap_id, image_size) tuples > > To make adding new things easier, there will be an additional > 'features' field, which is a mask of the features used by the image. > Clients will know whether they can use an image by checking if they > support all the features the image uses that the osd reports as being > incompatible (see get_info() below). > > RBD class interface > =================== > > Here's a proposed basic interface - new features will > add more functions and data to existing ones. > > /** > * Initialize the header with basic metadata. > * Extra features may initialize more fields in the future. > * Everything is stored as key/value pairs as omaps in the header object. > * > * If features the OSD does not understand are requested, -ENOSYS is > * returned. > */ > create(__le64 size, __le32 order, __le64 features) > > /** > * Get the metadata about the image required to do I/O > * to it. In the future this may include extra information for > * features that require it, like encryption/compression type. > * This extra data will be added at the end of the response, so > * clients that don't support it don't interpret it. > * > * Features that would require clients to be updated to access > * the image correctly (such as image bitmaps) are set in > * the incompat_features field. A client that doesn't understand > * those features will return an error when they try to open > * the image. > * > * The size and any extra information is read from the appropriate > * snapshot metadata, if snapid is not CEPH_NOSNAP. > * > * Returns __le64 size, __le64 order, __le64 features, > * __le64 incompat_features, __le64 snapseq and > * list of __le64 snapids > */ > get_info(__le64 snapid) > > /** > * Used when resizing the image. Sets the size in bytes. > */ > set_size(__le64 size) > > /** > * The same as the existing snap_add/snap_remove methods, but using the > * new format. > */ > snapshot_add(string snap_name, __le64 snap_id) > snapshot_remove(string snap_name) > > /** > * list snapshots - like the existing snap_list, but > * can return a subset of them. > * > * Returns __le64 snap_seq, __le64 snap_count, and a list of tuples > * (snap_id, snap_size) just like the current snap_list > */ > snapshot_list(__le64 max_len) > > /** > * The same as the existing method. Should only be called > * on the rbd_info object. > * Returns an id number to use for a new image. > */ > assign_bid() > > > RBD layering > ============ > > The first step is to implement trivial layering, i.e. > layering without bitmaps, as described at: > > http://marc.info/?l=ceph-devel&m=129867273303846&w=2 > > There are a couple of things that complicate the implementation: > > 1) making sure parent images are not deleted when children still > refer to them > > A simple way to solve this is to add a reference count to the parent > image. This can cause issues with partially deleted images, if the > reference count is decremented more than once because the child > image's header was only deleted the second time 'rbd rm' was run. > > To prevent this, a full list of children can be used. When an image is > cloned, the new image is added to the list of children. When a child is > deleted, it is removed from the list. Keeping this all in the parent > image's header leads to the second issue: I don't like the idea of writing anything to the parent. We can have an orthogonal directory in a different location. > > 2) cloning an image into a different pool without giving the cloner > write access to the parent image's pool > > The current capabilities implemented with cephx only allow you to > restrict users to reading, writing or executing class methods on a > per-pool basis. > > For the child image in rbd, we need to be able to read the data > objects of the parent image, but only interact with the parent image > header through certain class methods, namely add_child and > remove_child during cloning and deletion. > > One way to do this is adding a whitelist of class methods to the > capabilities system, but this would be hard to manage as more class > methods are added. A more manageable way is to give classes some > string they can interpret as permissions however they wish. Combined > with allowing clients to access objects matching certain prefixes, > this can restrict access to the image header to going through the rbd > class, but still allow allow read-only access to the data objects. That sounds useful feature, however, it also sounds like a much bigger hammer than you need for that specific problem. As I said, I don't think we should modify the parent. You can keep that list on a different object that relates to the parent (e.g., rbd_ref.$image_name), or in a central place. You can put that object in a different pool, where the client has write permissions. > > If we change the names of the rbd header and data objects to start > with rbd_header and rbd_data, respectively, we have something like: > > allow prefix rbd_header class rbd image-child pool=templates > allow prefix rbd_data r pool=templates > > where 'image-child' is interpreted by the rbd class to mean 'only > allow adding or removing a child'. > > The problem with this is that the restricted client can still remove > any child, not just images it has access to. To get around this, we > can give each image a randomly generated uuid, and store that in the > child header and the parent's list of children. Then when someone > calls remove_child, they must pass the uuid in addition to their pool, > name, and snapshot, and it will only be processed if it matches the > uuid in the parent header. > > One thing that's not addressed in the earlier design is how to make > images read-only. The simplest way would be to only support layering > on top of snapshots, which are read-only by definition. > > Another way would be to allow images to be set read-only or > read-write, and disallow setting images with children read-write. Are > there many use cases that would justify this second, more complicated > way? I think that explicitly setting a read-only flag on the parent (if not set) is enough. No need to do other magic (see my above comments). > > Copy-up > ======= > > Another feature we want to include with layering is the ability to > copy all remaining data from the parent image to the child image, to > break the dependency of the latter on the former. This does not change > snapshots that were taken earlier though - they still rely on the > parent image. Thus, the children of a parent image will need to > include snapshots as well, and the reference to the parent image will > be needed to interact with snapshots. Thus, we can't just remove the > information pointing the parent. Instead, we can add a boolean > has_parent field that is stored in the header and with each snapshot, > since some snapshots may be taken when the parent was still used, and > some after all the data has been copied to the child. To generalize: a snapshot context would contain the source rbd image. > > Renaming > ======== > > In order to support renaming layered images, we can use the id > assigned to each image in place of the name. We just need to store a > mapping from ids to names in each pool. Eventually this can replace > rbd_directory, when we stop supporting the old format. This can't > happen right now because clients assume rbd_directory is a tmap. > > Thus, the parent and child image lists would contain (pool name, image > id, snapshot name) tuples. Pools and snapshots can't be renamed, so > they don't have this problem. Image ids are unique within a pool, so > (pool name, image id) uniquely identifies an image. > > Resizing > ======== > > To support resizing of layered images, we need to keep track of the > minimum size the image ever was, so that if a child image is shrunk > and then expanded, the re-expanded space is treated as unused instead > of being read from the parent image. Since this can change over time, > we need to store this for each snapshot as well. > > In summary, the format changes specific to adding layering are: > > New object > ========== > > rbd_images_names // stores a mapping from image ids to image names > > New header fields > ================= > > * parent_pool, parent_image_id, parent_snapshot > * uuid > * children - tuples of (pool, image_id, snapshot) > * min_size > * has_parent > * new fields in snapshots: > - min_size > - has_parent > > New rbd class methods > ===================== > > /** > * Sets the parent, min_size, and has_parent keys. > * Fails if any of these keys exist, since the image already > * had a parent. > */ > set_parent(string pool_name, __le64 image_id, string snap_name) > > /** > * Sets has_parent to false. > */ > remove_parent() // after all parent data is copied to the child > > /** > * uuid is required here to prevent malicious users from > * removing children they don't have access to. > */ > add_child(string pool, __le64 image_id, string snapname, string uuid) > remove_child(string pool_name, __le64 image_id, string snapname, string > uuid) > > /** > * to be run on the rbd_image_names object. > */ > get_name(image_id) > set_name(image_id) > > Changes to existing class methods > ================================= > > The new snapshot fields will be added to the return value of snapshot_list. > snapshot_add will need to fill them in. > > create will generate a uuid for the image. > > Does anyone have any thoughts on the design? Any ways to make it simpler? > > Josh > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html