Re: [RFC] CephFS dmClock QoS Scheduler

Eric Ivancich <ivancich@xxxxxxxxxx> · Mon, 26 Oct 2020 11:55:14 -0400

Hi Yongseok,
I’m guessing you know this already, but dmClock becomes mClock if rho and delta are both 0. Within the dmClock library, the ServiceTracker class, which is run on the clients, tracks the rho and delta values. But you don’t have to use a ServiceTracker. In fact, in the 3 current uses of the dmClock library in ceph master (osd, crimson osd, rgw), none of them currently use a ServiceTracker, so they’re essentially getting mClock.

The other thing that’s worth considering is that “server” and “client’ can be viewed in an abstract sense. So in the osd, for example, the mClock “clients" are not the true clients, but instead the different classification of operations.

One of the nice things about dmClock is that the “servers” do not need to communicate directly among themselves in order to provide QoS. The “clients” provide extra information to the servers that allow them to compensate for the work of the other servers.

Eric

--
J. Eric Ivancich
he / him / his
Red Hat Storage
Ann Arbor, Michigan, USA

On Oct 26, 2020, at 7:44 AM, Yongseok Oh <yongseok.oh@xxxxxxxxxxxx> wrote:

Hi Josh and maintainers,

We have confirmed again that the dmClock algorithm does not address multiple clients running on the same subvolume. In the dmClock, the client tracker monitors how much workload has been performed on each server and acts as a sort of scheduler through sending rho/delta values the servers.  Our simple idea was that the total QoS IOPS is divided into multiple clients within a subvolume based on the workload dynamics or evenly manner. For example, assuming that  1000 IOPS is allocated to a subvolume and then the value is shared to periodically multiple clients. If 100 clients simultaneously issue requests on the same volume, each client can consume 10 IOPS by mClock scheduler. 

Like this, we can consider a client workload metric-based approach, but it is not easy to ensure QoS stability as client workloads are dynamically changed and the time period to obtain metrics affects the allocation accuracy. Additionally, per client session QoS can instead be considered, but it is difficult to predict and limit the number of sessions in the subvolume.

For this reason, instead of applying dmClock, mClock scheduler can be considered as a good solution to the noisy neighbor problem. Expected per subvolume QoS IOPS can also be calculated as follows. (Assuming MDS and OSD requests are almost evenly distributed and reservation and weight values are omitted for brief exploration.)

[MDS]
- Per MDS Limit IOPS * # of MDSs (Perf MDS Limit IOPS * 1 when ephemeral random pinning is configured.)
[OSD]
- Per OSD Limit IOPS * # of OSDs

Of course, depending on the workload or server conditions, the above equations may not be 100% satisfied. However, the QoS scheduler can be implemented without the client and manager modifications.

There are any other comments from a CephFS point of view?

Finally, we are going to implement our prototype that the mClock scheduler is applied to MDS and then make a pull request to share and discuss them. 

Thanks

Yongseok
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx