Hi Jeegn Seems a bit rigor. thanks ivan from eisoo On Wed, Dec 27, 2017 at 2:50 PM, Jeegn Chen <jeegnchen@xxxxxxxxx> wrote: > Hi Ivan, > > In the use case, we expected the Pool A and Pool B have different sets > of OSDs and different sets of hosts or racks are even recommended. > > Thanks, > Jeegn > > 2017-12-27 14:01 GMT+08:00 yuxiang fang <abcdeffyx@xxxxxxxxx>: >> Hi Jeegn >> >> It seems that new nodes have to be added to the same failure domain of >> Pool B, otherwise we can't expand the capacity. >> Then Pool B will be affected by recovery of Pool A, they are different >> Pools logically but distributed in same failure domain. >> >> thanks >> ivan from eisoo >> >> >> On Tue, Dec 26, 2017 at 9:48 AM, Jeegn Chen <jeegnchen@xxxxxxxxx> wrote: >>> Hi all, >>> >>> In the daily use of Ceph RGW cluster, we find some pain points when >>> using current one-bucket-one-data-pool implementation. >>> I guess one-bucket-multiple-data-pools may help (See the appended >>> detailed proposal). >>> What do you think? >>> >>> >>> https://etherpad.net/p/multiple-data-pool-support-for-a-bucket >>> >>> # **Multiple Data Pool Support for a Bucket** >>> >>> ## >>> >>> ## Motivation >>> >>> Currently, a bucket in RGW only has a single data pool (extra data >>> pool is just a temporary storage for in-progressing multipart meta >>> data, which is not in our consideration). The major pain points here >>> are: >>> >>> - When the data pool is out of storage and we have to expand it, we >>> either have to tolerate the performance penalty due to high recovery >>> IO or have to wait for a long time for the rebalance to complete(This >>> situation is especially true when the original cluster is relatively >>> small and the expansion usually means doubling the size). >>> >>> >>> - Although the new nodes increase the storage capacity, they also >>> reduce the average PG number per OSD, which may make the data >>> distribution uneven. To address this ,we either have to reweight or >>> have to increase the PG number, which means another data movement. >>> >>> If a bucket can have multiple data pools and switch between them, the >>> maintenance may be easier: >>> >>> - The cluster admin can simply add new nodes, create another data pool >>> and then make buckets write to the new pool. Thus no rebalance is >>> needed and in turn the expansion is quick and almost has no observable >>> impact on the bucket user. >>> >>> >>> - Say a bucket have 2 data pools: Pool A and Pool B. Some maintence is >>> needed for the nodes of both pools. The admin can make write operation >>> go to Pool B (read cannot be switched since data is not moved), >>> operate the nodes of Pool A, then switch the write IO back to Pool A, >>> goes on to operates the nodes of Pool B, so that the maintenance >>> operations may be carried out without high write IO interference and >>> in turn the risk and difficulties are reduced. >>> >>> ## Design >>> >>> "Multiple Data Pool" is borrowed from CephFS. Any directory in CephFS >>> can have its data pool and the data pool can be replaced any time. >>> After the data pool switch, the new file written to the directory is >>> persisted into the new pool while the old file in the previous pool is >>> still accessible since metadata in meta pool has the correct >>> reference. >>> >>> The major idea to support multiple data pool for a bucket is >>> >>> - Reuse the the existing data_pool to store the head, which always has >>> 0 size but keeps the manifest referring to the data part in another >>> pool. >>> >>> >>> - Add a new concept tail_pool, which is used to store the data except the heads. >>> >>> >>> - The data_pool of a bucket (now only has the heads, is in fact a >>> metadata pool or a head pool) should not be changed but the bucket can >>> switch between different tail_pools. >>> >>> ### Change in RGWZonePlacementInfo >>> >>> - Add a new field data_layout_type to RGWZonePlacementInfo. The >>> default value for data_layout_type is 0 (name it as UNIFIED), which >>> means current implementation. Let's use value 1 (name it as SPLITTED) >>> for Multiple Data Pool Support. >>> >>> >>> - Add a new field tail_pools, which is a list of pool names. >>> >>> >>> - Add a new field current_tail_pool, which is one of the pool names in >>> tail_pools. >>> >>> ### >>> >>> ### Change in Object Head >>> >>> Add a new xattr usr.rgw.tail_pool, which refer to the pool where the >>> tail parts of the object reside. >>> >>> ### Change in Multipart Meta Object >>> >>> Add a new xattr usr.rgw.tail_pool, which refer to the pool where the >>> multiparts of the object reside. This value should be decided in >>> InitMulitpart and be followed by other operations against the same >>> upload ID of the same object. >>> >>> ### Change in Write Operations >>> >>> If a bucket's data_layout_type is SPLITTED (and has no >>> explicit_placment), only write 0-size head (including >>> usr.rgw.tail_pool and all other xattrs) to data_pool and persist the >>> tail in current_tail_pool. >>> >>> For efficiency, it is recommended to use replicated pool on SSD as data_pool. >>> >>> ### Change in Read Operations >>> >>> If a bucket's data_layout_type is SPLITTED (and has no >>> explicit_placment), read the tail parts according to usr.rgw.tail_pool >>> in the head. >>> >>> ### >>> >>> ### Change in GC >>> >>> If a bucket's data_layout_type is SPLITTED (and has no >>> explicit_placment), the correct user.rgw.tail_pool should be record in >>> GC list as well so that GC thread can remove tail parts correctly. >>> >>> ### Change in radosgw-admin >>> >>> Need new commands to modify and show modified RGWZonePlacementInfo(Add >>> new pool to tail_pools, change current_tail_pool and so on) >>> >>> Thanks, >>> Jeegn >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html