Re: [RFC] CephFS dmClock QoS Scheduler

Josh Durgin <jdurgin@xxxxxxxxxx> · Mon, 28 Sep 2020 21:26:20 -0700

Hello Yongseok,

That is a very exciting project! We are also interested in
improving QoS. For Pacific, we've been focusing on OSD-side
mclock and balancing background work vs clients.

Sridhar is doing extensive testing and fixed a few issues with
the op scheduler integration and dmclock library, e.g.:

https://github.com/ceph/ceph/pull/37031
https://github.com/ceph/ceph/pull/37431

It is looking quite promising, we think osd_op_queue =
mclock_scheduler will be in good shape for Pacific.

There are a couple outstanding settings changes to make this
work:

1) bluestore throttling

If bluestore is not throttled enough, the osd op queue is empty
and mclock has no chance to prioritize requests.

On our nvme test hardware bluestore_throttle_bytes and
bluestore_throttle_deferred_bytes set to 128K allows us to get
near full throughput with the fewest ops in flight at the
bluestore level.

2) osd op queue sharding

The mclock algorithm works well with only a single
queue (osd_op_num_shards = 1). Further testing determined that
keeping the same total amount of parallelism
(osd_op_num_threads_per_shard = 16) allows us to get the same
performance as the default of 8 shards and 2 threads per shard.

Extending this to the client side like
https://github.com/ceph/ceph/pull/20235 would be great. There may
be some challenges left there, in terms of accounting for
in-flight I/Os. Eric, do you recall if you had found a solution
for that?

CephFS is challenging with that model - if multiple clients are
accessing the same subvolume, and they are meant to be limited in
aggregate, their independent dmClock trackers will not know about
in-flight I/O from other clients. Have you considered how to
resolve this?

Patrick, Greg, any other comments from a CephFS point of view?

Josh

On 9/25/20 3:42 AM, Yongseok Oh wrote:
Hi Ceph maintainers and developers,

The objective of this is to discuss our work on a dmClock based client QoS management for CephFS.

Our group at LINE maintains Ceph storage clusters such as RGW, RBD, and CephFS to internally support OpenStack and K8S based private cloud environment for various applications and platforms including LINE messenger. We have seen that the RGW and RBD services can provide consistent performance to multiple active users since RGW employes the dmClock QoS scheduler for S3 clients and hypervisors internally utilize I/O throttler for VM block storage clients. Unfortunately, unlike RGW and RBD, CephFS clients can directly issue metadata requests to MDSs and filedata requests OSDs as they want. This situation occasionally (or frequently) happens and the other client performance may be degraded by the noisy neighbor.  In the end, consistent performance cannot be guaranteed in our environment. From this observation and motivation, we are now considering the client QoS scheduler using the dmClock library for CephFS.

A few things about how to realize the QoS scheduler.

- Per subvolume QoS management. IOPS resources are only shared among the clients that mount the same root directory. QoS parameters can be easily configured through the extended attributes (similar to quota). Each dmClock scheduler can manage clients' requests using client session information.
- MDS QoS management. Client metadata requests like create, lookup, and etc. are managed by dmClock scheduler placed between the dispatcher and the main request handler (e.g., Server::handle_client_request()). We have observed that two active MDSs provide approximately 20KIOPS. As performance capacity is sometimes scarce for lots of clients, QoS management is needed for MDS.
- OSD QoS management. We would like to reopen and improve the previous work available at https://github.com/ceph/ceph/pull/20235.
- Client QoS management. Each client manages the dmClock tracker to keep track of both rho and delta to be packed to client request messages.

In case of the CLI, QoS parameters are configured using the extended attributes on each subvolume directory. Specifically, separate QoS configurations are considered for both MDSs and OSDs.

setfattr -n ceph.dmclock.mds_reservation -v 200 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.mds_weight -v 500 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.mds_limit -v 1000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55

setfattr -n ceph.dmclock.osd_reservation -v 500 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.osd_weight -v 1000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
setfattr -n ceph.dmclock.osd_limit -v 2000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55

Our QoS work has been kicked off from the previous month. Our first step is to go over the prior work and dmClock algorithm/library. Now we are actively focusing on checking the feasibility of our idea with some modifications to MDS and ceph-fuse. Our development is planned as follows.

- dmClock scheduler will be integrated into MDS and ceph-fuse by December 2020.
- dmClock scheduler will be incorporated with OSD by the first half of the next year.

Does the community have any plan to develop per client QoS management? Are there any other issues related to our QoS work?  We are looking forward to hearing your valuable comments and feedback at an early stage.

Thanks

Yongseok Oh
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx