RGW RFC: Multiple-Data-Pool Support for a Bucket

Jeegn Chen <jeegnchen@xxxxxxxxx> · Tue, 26 Dec 2017 09:48:26 +0800

Hi all,

In the daily use of Ceph RGW cluster, we find some pain points when
using current one-bucket-one-data-pool implementation.
I guess one-bucket-multiple-data-pools may help (See the appended
detailed proposal).
What do you think?

 https://etherpad.net/p/multiple-data-pool-support-for-a-bucket

# **Multiple Data Pool Support for a Bucket**

##

## Motivation

Currently, a bucket in RGW only has a single data pool (extra data
pool is just a temporary storage for in-progressing multipart meta
data, which is not in our consideration). The major pain points here
are:

- When the data pool is out of storage and we have to expand it, we
either have to tolerate the performance penalty due to high recovery
IO or have to wait for a long time for the rebalance to complete(This
situation is especially true when the original cluster is relatively
small and the expansion usually means doubling the size).

- Although the new nodes increase the storage capacity, they also
reduce the average PG number per OSD, which may make the data
distribution uneven. To address this ,we either have to reweight or
have to increase the PG number, which means another data movement.

If a bucket can have multiple data pools and switch between them, the
maintenance may be easier:

- The cluster admin can simply add new nodes, create another data pool
and then make buckets write to the new pool. Thus no rebalance is
needed and in turn the expansion is quick and almost has no observable
impact on the bucket user.

- Say a bucket have 2 data pools: Pool A and Pool B. Some maintence is
needed for the nodes of both pools. The admin can make write operation
go to Pool B (read cannot be switched since data is not moved),
operate the nodes of Pool A, then switch the write IO back to Pool A,
goes on to operates the nodes of Pool B, so that the maintenance
operations may be carried out without high write IO interference and
in turn the risk and difficulties are reduced.

## Design

"Multiple Data Pool" is borrowed from CephFS. Any directory in CephFS
can have its data pool and the data pool can be replaced any time.
After the data pool switch, the new file written to the directory is
persisted into the new pool while the old file in the previous pool is
still accessible since metadata in meta pool has the correct
reference.

The major idea to support multiple data pool for a bucket is

- Reuse the the existing data_pool to store the head, which always has
0 size but keeps the manifest referring to the data part in another
pool.

- Add a new concept tail_pool, which is used to store the data except the heads.

- The data_pool of a bucket (now only has the heads, is in fact a
metadata pool or a head pool) should not be changed but the bucket can
switch between different tail_pools.

### Change in RGWZonePlacementInfo

- Add a new field data_layout_type to RGWZonePlacementInfo. The
default value for data_layout_type is 0 (name it as UNIFIED), which
means current implementation. Let's use value 1 (name it as SPLITTED)
for Multiple Data Pool Support.

- Add a new field tail_pools, which is a list of pool names.

- Add a new field current_tail_pool, which is one of the pool names in
tail_pools.

###

### Change in Object Head

Add a new xattr usr.rgw.tail_pool, which refer to the pool where the
tail parts of the object reside.

### Change in Multipart Meta Object

Add a new xattr usr.rgw.tail_pool, which refer to the pool where the
multiparts of the object reside. This value should be decided in
InitMulitpart and be followed by other operations against the same
upload ID of the same object.

### Change in Write Operations

If a bucket's data_layout_type is SPLITTED (and has no
explicit_placment), only write 0-size head (including
usr.rgw.tail_pool and all other xattrs) to data_pool and persist the
tail in current_tail_pool.

For efficiency, it is recommended to use replicated pool on SSD as data_pool.

### Change in Read Operations

If a bucket's data_layout_type is SPLITTED (and has no
explicit_placment), read the tail parts according to usr.rgw.tail_pool
in the head.

###

### Change in GC

If a bucket's data_layout_type is SPLITTED (and has no
explicit_placment), the correct user.rgw.tail_pool should be record in
GC list as well so that GC thread can remove tail parts correctly.

### Change in radosgw-admin

Need new commands to modify and show modified RGWZonePlacementInfo(Add
new pool to tail_pools, change current_tail_pool and so on)

Thanks,
Jeegn
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html