Re: [RFC] CephFS dmClock QoS Scheduler

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 29 Sep 2020 18:04:36 -0700

On Mon, Sep 28, 2020 at 9:26 PM Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>
> Hello Yongseok,
>
> That is a very exciting project! We are also interested in
> improving QoS. For Pacific, we've been focusing on OSD-side
> mclock and balancing background work vs clients.
>
> Sridhar is doing extensive testing and fixed a few issues with
> the op scheduler integration and dmclock library, e.g.:
>
> https://github.com/ceph/ceph/pull/37031
> https://github.com/ceph/ceph/pull/37431
>
> It is looking quite promising, we think osd_op_queue =
> mclock_scheduler will be in good shape for Pacific.
>
> There are a couple outstanding settings changes to make this
> work:
>
> 1) bluestore throttling
>
> If bluestore is not throttled enough, the osd op queue is empty
> and mclock has no chance to prioritize requests.
>
> On our nvme test hardware bluestore_throttle_bytes and
> bluestore_throttle_deferred_bytes set to 128K allows us to get
> near full throughput with the fewest ops in flight at the
> bluestore level.
>
> 2) osd op queue sharding
>
> The mclock algorithm works well with only a single
> queue (osd_op_num_shards = 1). Further testing determined that
> keeping the same total amount of parallelism
> (osd_op_num_threads_per_shard = 16) allows us to get the same
> performance as the default of 8 shards and 2 threads per shard.
>
> Extending this to the client side like
> https://github.com/ceph/ceph/pull/20235 would be great. There may
> be some challenges left there, in terms of accounting for
> in-flight I/Os. Eric, do you recall if you had found a solution
> for that?
>
> CephFS is challenging with that model - if multiple clients are
> accessing the same subvolume, and they are meant to be limited in
> aggregate, their independent dmClock trackers will not know about
> in-flight I/O from other clients. Have you considered how to
> resolve this?
>
> Patrick, Greg, any other comments from a CephFS point of view?

That's basically what I've got. CephFS is tricky because we naively
expect the clients to be scaling out as well; most single-mounter
scenarios just end up on rbd because it's simpler.

We are starting to talk about maybe being able to do QoS in the
NFS-Ganesha frontend for NFS users (apparently it recently added some
tooling for this), but even then any real deployment will have
multiple active servers and I'm not sure if we can "typically" route
to a single one to track the global state. I could envision a system
where the clients periodically share their activity with the MDS, and
the MDS partitions their total QoS limits between any clients working
on the same data based on models of their changing "heat", but it
would still be really complicated and be prone to a lot of
over-limiting or burstiness. :/
-Greg

>
> Josh
>
> On 9/25/20 3:42 AM, Yongseok Oh wrote:
> > Hi Ceph maintainers and developers,
> >
> > The objective of this is to discuss our work on a dmClock based client QoS management for CephFS.
> >
> > Our group at LINE maintains Ceph storage clusters such as RGW, RBD, and CephFS to internally support OpenStack and K8S based private cloud environment for various applications and platforms including LINE messenger. We have seen that the RGW and RBD services can provide consistent performance to multiple active users since RGW employes the dmClock QoS scheduler for S3 clients and hypervisors internally utilize I/O throttler for VM block storage clients. Unfortunately, unlike RGW and RBD, CephFS clients can directly issue metadata requests to MDSs and filedata requests OSDs as they want. This situation occasionally (or frequently) happens and the other client performance may be degraded by the noisy neighbor.  In the end, consistent performance cannot be guaranteed in our environment. From this observation and motivation, we are now considering the client QoS scheduler using the dmClock library for CephFS.
> >
> > A few things about how to realize the QoS scheduler.
> >
> > - Per subvolume QoS management. IOPS resources are only shared among the clients that mount the same root directory. QoS parameters can be easily configured through the extended attributes (similar to quota). Each dmClock scheduler can manage clients' requests using client session information.
> > - MDS QoS management. Client metadata requests like create, lookup, and etc. are managed by dmClock scheduler placed between the dispatcher and the main request handler (e.g., Server::handle_client_request()). We have observed that two active MDSs provide approximately 20KIOPS. As performance capacity is sometimes scarce for lots of clients, QoS management is needed for MDS.
> > - OSD QoS management. We would like to reopen and improve the previous work available at https://github.com/ceph/ceph/pull/20235.
> > - Client QoS management. Each client manages the dmClock tracker to keep track of both rho and delta to be packed to client request messages.
> >
> > In case of the CLI, QoS parameters are configured using the extended attributes on each subvolume directory. Specifically, separate QoS configurations are considered for both MDSs and OSDs.
> >
> > setfattr -n ceph.dmclock.mds_reservation -v 200 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
> > setfattr -n ceph.dmclock.mds_weight -v 500 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
> > setfattr -n ceph.dmclock.mds_limit -v 1000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
> >
> > setfattr -n ceph.dmclock.osd_reservation -v 500 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
> > setfattr -n ceph.dmclock.osd_weight -v 1000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
> > setfattr -n ceph.dmclock.osd_limit -v 2000 /volumes/_nogroup/fdffc126-7961-4bbc-add2-2675b9e35a55
> >
> > Our QoS work has been kicked off from the previous month. Our first step is to go over the prior work and dmClock algorithm/library. Now we are actively focusing on checking the feasibility of our idea with some modifications to MDS and ceph-fuse. Our development is planned as follows.
> >
> > - dmClock scheduler will be integrated into MDS and ceph-fuse by December 2020.
> > - dmClock scheduler will be incorporated with OSD by the first half of the next year.
> >
> > Does the community have any plan to develop per client QoS management? Are there any other issues related to our QoS work?  We are looking forward to hearing your valuable comments and feedback at an early stage.
> >
> > Thanks
> >
> > Yongseok Oh
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx