Hi all, In the daily use of Ceph RGW cluster, we find some pain points when using current one-bucket-one-data-pool implementation. I guess one-bucket-multiple-data-pools may help (See the appended detailed proposal). What do you think? https://etherpad.net/p/multiple-data-pool-support-for-a-bucket # **Multiple Data Pool Support for a Bucket** ## ## Motivation Currently, a bucket in RGW only has a single data pool (extra data pool is just a temporary storage for in-progressing multipart meta data, which is not in our consideration). The major pain points here are: - When the data pool is out of storage and we have to expand it, we either have to tolerate the performance penalty due to high recovery IO or have to wait for a long time for the rebalance to complete(This situation is especially true when the original cluster is relatively small and the expansion usually means doubling the size). - Although the new nodes increase the storage capacity, they also reduce the average PG number per OSD, which may make the data distribution uneven. To address this ,we either have to reweight or have to increase the PG number, which means another data movement. If a bucket can have multiple data pools and switch between them, the maintenance may be easier: - The cluster admin can simply add new nodes, create another data pool and then make buckets write to the new pool. Thus no rebalance is needed and in turn the expansion is quick and almost has no observable impact on the bucket user. - Say a bucket have 2 data pools: Pool A and Pool B. Some maintence is needed for the nodes of both pools. The admin can make write operation go to Pool B (read cannot be switched since data is not moved), operate the nodes of Pool A, then switch the write IO back to Pool A, goes on to operates the nodes of Pool B, so that the maintenance operations may be carried out without high write IO interference and in turn the risk and difficulties are reduced. ## Design "Multiple Data Pool" is borrowed from CephFS. Any directory in CephFS can have its data pool and the data pool can be replaced any time. After the data pool switch, the new file written to the directory is persisted into the new pool while the old file in the previous pool is still accessible since metadata in meta pool has the correct reference. The major idea to support multiple data pool for a bucket is - Reuse the the existing data_pool to store the head, which always has 0 size but keeps the manifest referring to the data part in another pool. - Add a new concept tail_pool, which is used to store the data except the heads. - The data_pool of a bucket (now only has the heads, is in fact a metadata pool or a head pool) should not be changed but the bucket can switch between different tail_pools. ### Change in RGWZonePlacementInfo - Add a new field data_layout_type to RGWZonePlacementInfo. The default value for data_layout_type is 0 (name it as UNIFIED), which means current implementation. Let's use value 1 (name it as SPLITTED) for Multiple Data Pool Support. - Add a new field tail_pools, which is a list of pool names. - Add a new field current_tail_pool, which is one of the pool names in tail_pools. ### ### Change in Object Head Add a new xattr usr.rgw.tail_pool, which refer to the pool where the tail parts of the object reside. ### Change in Multipart Meta Object Add a new xattr usr.rgw.tail_pool, which refer to the pool where the multiparts of the object reside. This value should be decided in InitMulitpart and be followed by other operations against the same upload ID of the same object. ### Change in Write Operations If a bucket's data_layout_type is SPLITTED (and has no explicit_placment), only write 0-size head (including usr.rgw.tail_pool and all other xattrs) to data_pool and persist the tail in current_tail_pool. For efficiency, it is recommended to use replicated pool on SSD as data_pool. ### Change in Read Operations If a bucket's data_layout_type is SPLITTED (and has no explicit_placment), read the tail parts according to usr.rgw.tail_pool in the head. ### ### Change in GC If a bucket's data_layout_type is SPLITTED (and has no explicit_placment), the correct user.rgw.tail_pool should be record in GC list as well so that GC thread can remove tail parts correctly. ### Change in radosgw-admin Need new commands to modify and show modified RGWZonePlacementInfo(Add new pool to tail_pools, change current_tail_pool and so on) Thanks, Jeegn -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html