Idea for integration of Jeegn Chen's idea and storage classes (thus lifecycle): Concept: STORAGE CLASSES are backed by one or more RADOS POOLS. This already roughly exists in placement policies. Rough implementation: - For writes, the RGW zone data describes which pool maps to each storage class. (corner case: multipart uploads might need each part in a consistent pool) - The bucket-index data already describes the RADOS POOL to _read_ from (corner case: old buckets/objects don't have this set) - radosgw-admin already contains bucket/object rewrite functionality, that would effectively copy from an old pool into the new pool. (note: I don't think is well-documented at all) On Fri, Dec 29, 2017 at 11:08:45AM +0800, Jeegn Chen wrote: > The failure domain of the pools is out of control of RGW. Admins can > create the pools in the way their prefer. This proposal just give more > flexibility and possibility. > > I think multiple pools support may just be a experimental start. If it > works stably in production area, we may even extend it to support > STORAGE CLASS (The same bucket have objects in different pools > according to the STORAGE CLASS) in S3 in the future and then > leveraging lifecycle to move objects between different STORAGE CLASS > may also be possible (of course, more careful design may be needed > make the new complexity elegant). > > Thanks, > Jeegn > > 2017-12-27 15:51 GMT+08:00 yuxiang fang <abcdeffyx@xxxxxxxxx>: > > Hi Jeegn > > > > Seems a bit rigor. > > thanks > > ivan from eisoo > > > > > > On Wed, Dec 27, 2017 at 2:50 PM, Jeegn Chen <jeegnchen@xxxxxxxxx> wrote: > >> Hi Ivan, > >> > >> In the use case, we expected the Pool A and Pool B have different sets > >> of OSDs and different sets of hosts or racks are even recommended. > >> > >> Thanks, > >> Jeegn > >> > >> 2017-12-27 14:01 GMT+08:00 yuxiang fang <abcdeffyx@xxxxxxxxx>: > >>> Hi Jeegn > >>> > >>> It seems that new nodes have to be added to the same failure domain of > >>> Pool B, otherwise we can't expand the capacity. > >>> Then Pool B will be affected by recovery of Pool A, they are different > >>> Pools logically but distributed in same failure domain. > >>> > >>> thanks > >>> ivan from eisoo > >>> > >>> > >>> On Tue, Dec 26, 2017 at 9:48 AM, Jeegn Chen <jeegnchen@xxxxxxxxx> wrote: > >>>> Hi all, > >>>> > >>>> In the daily use of Ceph RGW cluster, we find some pain points when > >>>> using current one-bucket-one-data-pool implementation. > >>>> I guess one-bucket-multiple-data-pools may help (See the appended > >>>> detailed proposal). > >>>> What do you think? > >>>> > >>>> > >>>> https://etherpad.net/p/multiple-data-pool-support-for-a-bucket > >>>> > >>>> # **Multiple Data Pool Support for a Bucket** > >>>> > >>>> ## > >>>> > >>>> ## Motivation > >>>> > >>>> Currently, a bucket in RGW only has a single data pool (extra data > >>>> pool is just a temporary storage for in-progressing multipart meta > >>>> data, which is not in our consideration). The major pain points here > >>>> are: > >>>> > >>>> - When the data pool is out of storage and we have to expand it, we > >>>> either have to tolerate the performance penalty due to high recovery > >>>> IO or have to wait for a long time for the rebalance to complete(This > >>>> situation is especially true when the original cluster is relatively > >>>> small and the expansion usually means doubling the size). > >>>> > >>>> > >>>> - Although the new nodes increase the storage capacity, they also > >>>> reduce the average PG number per OSD, which may make the data > >>>> distribution uneven. To address this ,we either have to reweight or > >>>> have to increase the PG number, which means another data movement. > >>>> > >>>> If a bucket can have multiple data pools and switch between them, the > >>>> maintenance may be easier: > >>>> > >>>> - The cluster admin can simply add new nodes, create another data pool > >>>> and then make buckets write to the new pool. Thus no rebalance is > >>>> needed and in turn the expansion is quick and almost has no observable > >>>> impact on the bucket user. > >>>> > >>>> > >>>> - Say a bucket have 2 data pools: Pool A and Pool B. Some maintence is > >>>> needed for the nodes of both pools. The admin can make write operation > >>>> go to Pool B (read cannot be switched since data is not moved), > >>>> operate the nodes of Pool A, then switch the write IO back to Pool A, > >>>> goes on to operates the nodes of Pool B, so that the maintenance > >>>> operations may be carried out without high write IO interference and > >>>> in turn the risk and difficulties are reduced. > >>>> > >>>> ## Design > >>>> > >>>> "Multiple Data Pool" is borrowed from CephFS. Any directory in CephFS > >>>> can have its data pool and the data pool can be replaced any time. > >>>> After the data pool switch, the new file written to the directory is > >>>> persisted into the new pool while the old file in the previous pool is > >>>> still accessible since metadata in meta pool has the correct > >>>> reference. > >>>> > >>>> The major idea to support multiple data pool for a bucket is > >>>> > >>>> - Reuse the the existing data_pool to store the head, which always has > >>>> 0 size but keeps the manifest referring to the data part in another > >>>> pool. > >>>> > >>>> > >>>> - Add a new concept tail_pool, which is used to store the data except the heads. > >>>> > >>>> > >>>> - The data_pool of a bucket (now only has the heads, is in fact a > >>>> metadata pool or a head pool) should not be changed but the bucket can > >>>> switch between different tail_pools. > >>>> > >>>> ### Change in RGWZonePlacementInfo > >>>> > >>>> - Add a new field data_layout_type to RGWZonePlacementInfo. The > >>>> default value for data_layout_type is 0 (name it as UNIFIED), which > >>>> means current implementation. Let's use value 1 (name it as SPLITTED) > >>>> for Multiple Data Pool Support. > >>>> > >>>> > >>>> - Add a new field tail_pools, which is a list of pool names. > >>>> > >>>> > >>>> - Add a new field current_tail_pool, which is one of the pool names in > >>>> tail_pools. > >>>> > >>>> ### > >>>> > >>>> ### Change in Object Head > >>>> > >>>> Add a new xattr usr.rgw.tail_pool, which refer to the pool where the > >>>> tail parts of the object reside. > >>>> > >>>> ### Change in Multipart Meta Object > >>>> > >>>> Add a new xattr usr.rgw.tail_pool, which refer to the pool where the > >>>> multiparts of the object reside. This value should be decided in > >>>> InitMulitpart and be followed by other operations against the same > >>>> upload ID of the same object. > >>>> > >>>> ### Change in Write Operations > >>>> > >>>> If a bucket's data_layout_type is SPLITTED (and has no > >>>> explicit_placment), only write 0-size head (including > >>>> usr.rgw.tail_pool and all other xattrs) to data_pool and persist the > >>>> tail in current_tail_pool. > >>>> > >>>> For efficiency, it is recommended to use replicated pool on SSD as data_pool. > >>>> > >>>> ### Change in Read Operations > >>>> > >>>> If a bucket's data_layout_type is SPLITTED (and has no > >>>> explicit_placment), read the tail parts according to usr.rgw.tail_pool > >>>> in the head. > >>>> > >>>> ### > >>>> > >>>> ### Change in GC > >>>> > >>>> If a bucket's data_layout_type is SPLITTED (and has no > >>>> explicit_placment), the correct user.rgw.tail_pool should be record in > >>>> GC list as well so that GC thread can remove tail parts correctly. > >>>> > >>>> ### Change in radosgw-admin > >>>> > >>>> Need new commands to modify and show modified RGWZonePlacementInfo(Add > >>>> new pool to tail_pools, change current_tail_pool and so on) > >>>> > >>>> Thanks, > >>>> Jeegn > >>>> -- > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Treasurer E-Mail : robbat2@xxxxxxxxxx GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Attachment:
signature.asc
Description: Digital signature