Hi Robin > Rough implementation: > - For writes, the RGW zone data describes which pool maps to each > storage class. > (corner case: multipart uploads might need each part in a consistent pool) [Jeegn]: Do you mean mean different placement polices are corresponding to STORAGE CLASSes? > - The bucket-index data already describes the RADOS POOL to _read_ from > (corner case: old buckets/objects don't have this set) [Jeegn]: Are you referring to the tail_bucket in the RGWObjManifest, which is somewhat a copy of placement rule? Currently, it seems to support the copy across buckets with the same data pool and I have not found the logic to deal with different pools. But yes, maybe we can reuse it instead of add additional xattr to track the tail pools. > - radosgw-admin already contains bucket/object rewrite functionality, > that would effectively copy from an old pool into the new pool. > (note: I don't think is well-documented at all) [Jeegn]: Are you talking about "radosgw-admin bucket rewrite"? But per check through RGWRados::rewrite_obj() and check_min_obj_stripe_size() in the master branch, the functionality is used to migrate from explicit-obj implementation to manifest implementation. Or you mean some command else or some implementation wip? Thanks, Jeegn 2017-12-29 11:50 GMT+08:00 Robin H. Johnson <robbat2@xxxxxxxxxx>: > Idea for integration of Jeegn Chen's idea and storage classes (thus > lifecycle): > > Concept: STORAGE CLASSES are backed by one or more RADOS POOLS. > > This already roughly exists in placement policies. > > Rough implementation: > - For writes, the RGW zone data describes which pool maps to each > storage class. > (corner case: multipart uploads might need each part in a consistent pool) > - The bucket-index data already describes the RADOS POOL to _read_ from > (corner case: old buckets/objects don't have this set) > - radosgw-admin already contains bucket/object rewrite functionality, > that would effectively copy from an old pool into the new pool. > (note: I don't think is well-documented at all) > > On Fri, Dec 29, 2017 at 11:08:45AM +0800, Jeegn Chen wrote: >> The failure domain of the pools is out of control of RGW. Admins can >> create the pools in the way their prefer. This proposal just give more >> flexibility and possibility. >> >> I think multiple pools support may just be a experimental start. If it >> works stably in production area, we may even extend it to support >> STORAGE CLASS (The same bucket have objects in different pools >> according to the STORAGE CLASS) in S3 in the future and then >> leveraging lifecycle to move objects between different STORAGE CLASS >> may also be possible (of course, more careful design may be needed >> make the new complexity elegant). >> >> Thanks, >> Jeegn >> >> 2017-12-27 15:51 GMT+08:00 yuxiang fang <abcdeffyx@xxxxxxxxx>: >> > Hi Jeegn >> > >> > Seems a bit rigor. >> > thanks >> > ivan from eisoo >> > >> > >> > On Wed, Dec 27, 2017 at 2:50 PM, Jeegn Chen <jeegnchen@xxxxxxxxx> wrote: >> >> Hi Ivan, >> >> >> >> In the use case, we expected the Pool A and Pool B have different sets >> >> of OSDs and different sets of hosts or racks are even recommended. >> >> >> >> Thanks, >> >> Jeegn >> >> >> >> 2017-12-27 14:01 GMT+08:00 yuxiang fang <abcdeffyx@xxxxxxxxx>: >> >>> Hi Jeegn >> >>> >> >>> It seems that new nodes have to be added to the same failure domain of >> >>> Pool B, otherwise we can't expand the capacity. >> >>> Then Pool B will be affected by recovery of Pool A, they are different >> >>> Pools logically but distributed in same failure domain. >> >>> >> >>> thanks >> >>> ivan from eisoo >> >>> >> >>> >> >>> On Tue, Dec 26, 2017 at 9:48 AM, Jeegn Chen <jeegnchen@xxxxxxxxx> wrote: >> >>>> Hi all, >> >>>> >> >>>> In the daily use of Ceph RGW cluster, we find some pain points when >> >>>> using current one-bucket-one-data-pool implementation. >> >>>> I guess one-bucket-multiple-data-pools may help (See the appended >> >>>> detailed proposal). >> >>>> What do you think? >> >>>> >> >>>> >> >>>> https://etherpad.net/p/multiple-data-pool-support-for-a-bucket >> >>>> >> >>>> # **Multiple Data Pool Support for a Bucket** >> >>>> >> >>>> ## >> >>>> >> >>>> ## Motivation >> >>>> >> >>>> Currently, a bucket in RGW only has a single data pool (extra data >> >>>> pool is just a temporary storage for in-progressing multipart meta >> >>>> data, which is not in our consideration). The major pain points here >> >>>> are: >> >>>> >> >>>> - When the data pool is out of storage and we have to expand it, we >> >>>> either have to tolerate the performance penalty due to high recovery >> >>>> IO or have to wait for a long time for the rebalance to complete(This >> >>>> situation is especially true when the original cluster is relatively >> >>>> small and the expansion usually means doubling the size). >> >>>> >> >>>> >> >>>> - Although the new nodes increase the storage capacity, they also >> >>>> reduce the average PG number per OSD, which may make the data >> >>>> distribution uneven. To address this ,we either have to reweight or >> >>>> have to increase the PG number, which means another data movement. >> >>>> >> >>>> If a bucket can have multiple data pools and switch between them, the >> >>>> maintenance may be easier: >> >>>> >> >>>> - The cluster admin can simply add new nodes, create another data pool >> >>>> and then make buckets write to the new pool. Thus no rebalance is >> >>>> needed and in turn the expansion is quick and almost has no observable >> >>>> impact on the bucket user. >> >>>> >> >>>> >> >>>> - Say a bucket have 2 data pools: Pool A and Pool B. Some maintence is >> >>>> needed for the nodes of both pools. The admin can make write operation >> >>>> go to Pool B (read cannot be switched since data is not moved), >> >>>> operate the nodes of Pool A, then switch the write IO back to Pool A, >> >>>> goes on to operates the nodes of Pool B, so that the maintenance >> >>>> operations may be carried out without high write IO interference and >> >>>> in turn the risk and difficulties are reduced. >> >>>> >> >>>> ## Design >> >>>> >> >>>> "Multiple Data Pool" is borrowed from CephFS. Any directory in CephFS >> >>>> can have its data pool and the data pool can be replaced any time. >> >>>> After the data pool switch, the new file written to the directory is >> >>>> persisted into the new pool while the old file in the previous pool is >> >>>> still accessible since metadata in meta pool has the correct >> >>>> reference. >> >>>> >> >>>> The major idea to support multiple data pool for a bucket is >> >>>> >> >>>> - Reuse the the existing data_pool to store the head, which always has >> >>>> 0 size but keeps the manifest referring to the data part in another >> >>>> pool. >> >>>> >> >>>> >> >>>> - Add a new concept tail_pool, which is used to store the data except the heads. >> >>>> >> >>>> >> >>>> - The data_pool of a bucket (now only has the heads, is in fact a >> >>>> metadata pool or a head pool) should not be changed but the bucket can >> >>>> switch between different tail_pools. >> >>>> >> >>>> ### Change in RGWZonePlacementInfo >> >>>> >> >>>> - Add a new field data_layout_type to RGWZonePlacementInfo. The >> >>>> default value for data_layout_type is 0 (name it as UNIFIED), which >> >>>> means current implementation. Let's use value 1 (name it as SPLITTED) >> >>>> for Multiple Data Pool Support. >> >>>> >> >>>> >> >>>> - Add a new field tail_pools, which is a list of pool names. >> >>>> >> >>>> >> >>>> - Add a new field current_tail_pool, which is one of the pool names in >> >>>> tail_pools. >> >>>> >> >>>> ### >> >>>> >> >>>> ### Change in Object Head >> >>>> >> >>>> Add a new xattr usr.rgw.tail_pool, which refer to the pool where the >> >>>> tail parts of the object reside. >> >>>> >> >>>> ### Change in Multipart Meta Object >> >>>> >> >>>> Add a new xattr usr.rgw.tail_pool, which refer to the pool where the >> >>>> multiparts of the object reside. This value should be decided in >> >>>> InitMulitpart and be followed by other operations against the same >> >>>> upload ID of the same object. >> >>>> >> >>>> ### Change in Write Operations >> >>>> >> >>>> If a bucket's data_layout_type is SPLITTED (and has no >> >>>> explicit_placment), only write 0-size head (including >> >>>> usr.rgw.tail_pool and all other xattrs) to data_pool and persist the >> >>>> tail in current_tail_pool. >> >>>> >> >>>> For efficiency, it is recommended to use replicated pool on SSD as data_pool. >> >>>> >> >>>> ### Change in Read Operations >> >>>> >> >>>> If a bucket's data_layout_type is SPLITTED (and has no >> >>>> explicit_placment), read the tail parts according to usr.rgw.tail_pool >> >>>> in the head. >> >>>> >> >>>> ### >> >>>> >> >>>> ### Change in GC >> >>>> >> >>>> If a bucket's data_layout_type is SPLITTED (and has no >> >>>> explicit_placment), the correct user.rgw.tail_pool should be record in >> >>>> GC list as well so that GC thread can remove tail parts correctly. >> >>>> >> >>>> ### Change in radosgw-admin >> >>>> >> >>>> Need new commands to modify and show modified RGWZonePlacementInfo(Add >> >>>> new pool to tail_pools, change current_tail_pool and so on) >> >>>> >> >>>> Thanks, >> >>>> Jeegn >> >>>> -- >> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- > Robin Hugh Johnson > Gentoo Linux: Dev, Infra Lead, Foundation Treasurer > E-Mail : robbat2@xxxxxxxxxx > GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 > GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html