Re: [RFC] CephFS dmClock QoS Scheduler

Josh Durgin <jdurgin@xxxxxxxxxx> · Thu, 15 Oct 2020 17:17:23 -0700

Hi Yongseok, apologies for the delayed response.

On 10/1/20 7:49 AM, Yongseok Oh wrote:
Hi Josh and Greg,

Let me try to recall major issues related to dmClock QoS scheduler from the previous PRs and your helpful comments.

[MDS]
- NFS Ganesha.  Exploiting NFS Ganesha is one of the possible solutions to the client QoS. But, I think introducing a new layer has the potential to cause limitations in terms of overall performance or scalability. It cannot also cover the CephFS native clients.

- Multiple different clients on the same subvolume. In this case, subvolume is probably shared with multiple clients, where each client maintains their own QoS tracker, resulting in inappropriate scheduling. In other words, another layer (or process) must monitor the global QoS state among clients. Now we can simply devise a client group approach that each client group constitutes numerous clients and allocated IOPS are divided equally or based on a policy. For instance, there are four clients in a group running on the same volume and 1000 IOPS are given per group, each client can satisfy a 250 IOPS. Another approach (more complex) is that a client perf monitor classifies clients' workloads and dynamically tunes and reallocates their IOPS based on workloads.

This seems like a good approach.

[OSD]
- multiple OP queue shards per OSD. Previously, they have mentioned [1] that it is difficult to provide the right QoS with the less in-flight request as the number of shards increases and requests are distributed to shards. To overcome this issue, they have proposed two solutions. One is that the number of shards is just set to ‘1’. The other is Outstanding I/O, namely OIO, throttler that gathers many requests as much as possible with the insertion of short latency. As clients cannot distinguish between shards in an OSD, they have come up with the shard identifier along with OSD ID [2]. For background operations, normalized rho/delta values are calculated based on their average numbers [3].

Settings shards to 1 is simpler, and has shown the same performance when
we keep the total number of threads constant (i.e. 1 shard x 16
threads, instead of today's default of 8 x 2 on ssd). We'll propose
changing the defaults to reflect this before Pacific.

[Compatibility]
- One of the feature bits needs to be allotted to QoS for compatibility. Since the bits are very limited resources, further discussion and confirmation by maintainers are required [4].

We may be able to piggyback on the SERVER_PACIFIC feature bit to avoid
consuming another one. Ilya, any concern about that from the kernel
client?

[References]
[1] http://bos.itdks.com/f54f1c93811d419398edb8b8c9cb35d0.pdf
[2] https://github.com/ceph/ceph/pull/16369
[3] https://github.com/ceph/ceph/pull/18280
[4] https://github.com/ceph/ceph/pull/17450
[5] https://github.com/ceph/ceph/pull/20235

I think rebasing the PR [5] to the master has a high priority to recover the latest state and check the feasibility.

Agreed.

Regards,
Josh
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx