Here are my current thoughts about tiering, and also specifically about cloud tiering. 1. Storage-classes Previously a placement target would be mapped into a set of rados pools (index, data, extra), whereas now placement targets will add storage classes (S3 uses these). Object placement will be defined by the placement target, and the storage class. - If storage class is not specified, the standard class is being used, or the default class for the bucket. - The set of supported storage-classes will need to be defined as part of the zonegroup configuration - Each zone will have a mapping between the existing storage classes and how the set of rados pools A bucket has a default placement target. Objects heads are always written to the default placement target, even if object is being put on a different placement target. The object's manifest has a tail - A bucket will have a default storage-class. The X-Amz-Storage-Class header can be set when creating the bucket, and this will set the default placement target for the bucket. Note that this cannot be changed (as this is where objects' heads reside). We should probably make it so that when head and tail are being placed on different placement targets, the head will not contain any data, other than the object’s metadata. The code that implements the above can be found here: https://github.com/yehudasa/ceph/tree/wip-rgw-tiering-3 (multipart upload stuff is incomplete, but is addressed now) 2. Cloud targets There are many options. We’re not going to implement everything. Here are a few points to consider: - How is data written to the backend cloud? The question here is whether the generated objects can be read directly by client application, or are we going to mangle the data in some way. For example stripe data, encrypt, etc. - Indexed by us? The important question here is actually: do we keep a head object for each object that is created on the remote tier? Do we keep a bucket index, or do we rely on the backing cloud for this info? If we index it, how do we make sure we keep synchronized? Do we need to? - Proxied? When reading (and possibly writing) data, are we going to serve as a proxy, or do we just send redirects? Redirects might be the easiest way to implement tiering, however, it cripples access control. As we don't have complete control over the remote cloud (probably have credentials that will represent a single user). - Bucket/object name mappings When dealing with cloud services where we don’t have complete control over, we’d need to map bucket and object names to ones that will be used on the cloud service. This means that multiple rgw buckets could be written to the same destination bucket. The cloud sync code does the same thing. - ACL mappings Object ACLs need to be converted to ACLs on the remote system. The cloud sync code does the same thing. 3. Cloud tier implementation A lot of it depends on what we decide to do in (2). I think that as a start we can focus on the following: * objects on cloud tier should be readable externally This entails a few things. It means that objects aren't striped or encoded in some way, but are kept as whole objects on the backend. * indexed by rgw, proxied writes, some reads (user's object data reads) can be redirected, should be able to read remote objects internally The reasoning behind this is that it keeps the current rgw behavior of having head object that keeps object's metadata. Without it most of the rgw object functionality will not work, and I think that as a first step we want to keep the functional behavior closer. This entails that we also index the objects, although bucket listing can be redirected probably. User object reads don't need to be proxied, as long as presigned redirects can work. Implementing this will require: - creating new type of objects put processor that will be able to store the data remotely. The head object should still be stored on the bucket's default tier. Note that for this to work we will need to make sure that even if bucket's default tier is a cloud tier, we will still treat it as local tier for storing the objects' heads. - Object read iteration should be able to read remote object. - Object copy could trigger remote copy (if source and dest at the same remote tier) - In general, object copies from and to remote should be done via a background worker, and might take too long. - The manifest should also reflect the required info. In any case, it no longer stores any info that is rados specific, so it might not require much (or even any) changes. - We should refactor the whole data object access api, so that things are done cleanly. - Stuff like multipart objects will also need to be addressed. Part creation will need to be proxied, and the complete will create the needed local head. - Remote cloud objects could be versioned, in which case we could have a more reliable head to tail mapping. Thoughts? Yehuda