Re: unified qos model

Byung Su Park <pbs1108@xxxxxxxxx> · Mon, 19 Mar 2018 15:09:04 +0900

Hi, Sage,

I have left some comments in the middle of writing.

2018-03-16 6:47 GMT+09:00 Sage Weil <sweil@xxxxxxxxxx>:
> I've been thinking a bit about how the three [d]mclock modes (client,
> pool, op type) modes can be combined.  We'd talked before about a
> hierarchical mode of some sort.  Just thinking a bit about how I'd
> want/expect the various reservations and limits to interact, though, I'm
> not sure how well that will work.
>
> For example, let's look at just the op type vs client modes.  On the op
> front, I would expect 2 categories of work: client work and background
> work (scrub, recovery, snap trimming, etc.).  And I would expect the
> background "knob" to be something like "max of 20% background work" (or
> rather, "reduce overall client throughput by no more than 20%").  I would
> also expect/want something like "reserve at least 10% for background work
> to make forward progress".

As discussed in your comments and at
https://github.com/ceph/ceph/pull/18280, we agree that it is
structured as follows.

Primary Level: op type vs. client modes (i.e. 30% vs. 70%).
Second Level: op type: scrub, recovery, snap trimming, etc. within 30%.
                        client modes: client1 (or pool1), client 2,
client 3, ... , within 70%.

And, as above, we discussed in pull request 18280 that we need to
construct a double mclock queue to get it in hierarchy form.
Also, based on the need for these things, the author of mClock has
already published a paper called "hClock: hierarchical QoS for packet
scheduling in a hypervisor".
(https://www.yumpu.com/en/document/view/51631478/hclock-hierarchical-qos-for-packet-scheduling-in-a-eurosys-2013/3).

As Eric said earlier, if we currently have mclock in a hierarchy form,
I think it will be possible to have a above configuration.
To implement this, implementation discussion will be needed as shown below.

1. Implement two hierarchical structures within the dmclock library.
2. Like the 18280 PR, two layers (external & internal) of dmclock
queue are used in Ceph.

>
> It's not clear to me that either of those limits "fit" into the mclock
> model of a reservation or priority; they're a bit of a special case.
>
> Similarly, in the pool policy case, I'm struggling to imagine how that
> would be configured.  It doesn't make sense to have a "reservation" for
> pools since we don't have a global view.  We could have a priority
> relative to other pools (there'd need to be a 'default' priority for pools
> that don't have it set).  Or, we could have a similar percentage-style
> reservation like the above.  And these would probably be subserviant to
> the client reservations (i.e., come out of the leftover by-priority
> phase).
>
As you said, hierarchy style and percentage-style qos settings are not
immediately available for mClock.
The hierarchy style should be implemented as a multi-level
implementation of mClock as described above.
In my opinion, the percentage-style QoS configuration would be
possible if we were doing total throughput estimation as you said.

For example, suppose throughput per OSD according to SSD or HDD, and
total throughput is determined every time OSD is added.
When we suppose 5 IOPs per SSD type OSD, the 16 OSD ceph cluster has
about 80 IOPs throughput.
When 10% of the background is allocated to the background I/O, 8 IOPs
are internally set to the background IO in dmClock.
This assumption of total throughput per OSD can be referenced in the
QoS configuration of the Netapps's SolidFire storage system.
(https://www.slideshare.net/NetApp/gain-storage-control-with-sioc-and-take-performance-control-with-qos-from-solidfire-76354056#14).
(In addition, SolidFire QoS configuration sets QoS through IOPs per
block configuration automatically.).

> The percentage-style configurables are tricky, though, because we
> (probably?) need to have a model of what 100% is in order to make them
> work?  If we estimate the total throughput then we could model the
> background work as a client with a min reservation and a max, I guess...
>
> Anyway, the whole thing makes me think that we might end up a custom
> scheduler that is an ad hoc combination of mclock and these secondary
> constraints...
>
> Has anyone had any bright ideas here?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html