Re: unified qos model

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 19 Mar 2018 17:52:58 +0800

On Mon, Mar 19, 2018 at 2:09 PM, Byung Su Park <pbs1108@xxxxxxxxx> wrote:
> Hi, Sage,
>
> I have left some comments in the middle of writing.
>
> 2018-03-16 6:47 GMT+09:00 Sage Weil <sweil@xxxxxxxxxx>:
>> I've been thinking a bit about how the three [d]mclock modes (client,
>> pool, op type) modes can be combined.  We'd talked before about a
>> hierarchical mode of some sort.  Just thinking a bit about how I'd
>> want/expect the various reservations and limits to interact, though, I'm
>> not sure how well that will work.
>>
>> For example, let's look at just the op type vs client modes.  On the op
>> front, I would expect 2 categories of work: client work and background
>> work (scrub, recovery, snap trimming, etc.).  And I would expect the
>> background "knob" to be something like "max of 20% background work" (or
>> rather, "reduce overall client throughput by no more than 20%").  I would
>> also expect/want something like "reserve at least 10% for background work
>> to make forward progress".
>
> As discussed in your comments and at
> https://github.com/ceph/ceph/pull/18280, we agree that it is
> structured as follows.
>
> Primary Level: op type vs. client modes (i.e. 30% vs. 70%).
> Second Level: op type: scrub, recovery, snap trimming, etc. within 30%.
>                         client modes: client1 (or pool1), client 2,
> client 3, ... , within 70%.
>
> And, as above, we discussed in pull request 18280 that we need to
> construct a double mclock queue to get it in hierarchy form.
> Also, based on the need for these things, the author of mClock has
> already published a paper called "hClock: hierarchical QoS for packet
> scheduling in a hypervisor".
> (https://www.yumpu.com/en/document/view/51631478/hclock-hierarchical-qos-for-packet-scheduling-in-a-eurosys-2013/3).
>
> As Eric said earlier, if we currently have mclock in a hierarchy form,
> I think it will be possible to have a above configuration.
> To implement this, implementation discussion will be needed as shown below.
>
> 1. Implement two hierarchical structures within the dmclock library.
> 2. Like the 18280 PR, two layers (external & internal) of dmclock
> queue are used in Ceph.

This basically makes sense on its own, but how do we deal with
client-demanded recovery in a multi-level system?

That is, if the client wants object foo, and foo needs to be recovered
from another OSD, how does it move through the mclock hierarchies?

>> It's not clear to me that either of those limits "fit" into the mclock
>> model of a reservation or priority; they're a bit of a special case.
>>
>> Similarly, in the pool policy case, I'm struggling to imagine how that
>> would be configured.  It doesn't make sense to have a "reservation" for
>> pools since we don't have a global view.  We could have a priority
>> relative to other pools (there'd need to be a 'default' priority for pools
>> that don't have it set).  Or, we could have a similar percentage-style
>> reservation like the above.  And these would probably be subserviant to
>> the client reservations (i.e., come out of the leftover by-priority
>> phase).
>>
> As you said, hierarchy style and percentage-style qos settings are not
> immediately available for mClock.
> The hierarchy style should be implemented as a multi-level
> implementation of mClock as described above.
> In my opinion, the percentage-style QoS configuration would be
> possible if we were doing total throughput estimation as you said.
>
> For example, suppose throughput per OSD according to SSD or HDD, and
> total throughput is determined every time OSD is added.
> When we suppose 5 IOPs per SSD type OSD, the 16 OSD ceph cluster has
> about 80 IOPs throughput.
> When 10% of the background is allocated to the background I/O, 8 IOPs
> are internally set to the background IO in dmClock.
> This assumption of total throughput per OSD can be referenced in the
> QoS configuration of the Netapps's SolidFire storage system.
> (https://www.slideshare.net/NetApp/gain-storage-control-with-sioc-and-take-performance-control-with-qos-from-solidfire-76354056#14).
> (In addition, SolidFire QoS configuration sets QoS through IOPs per
> block configuration automatically.).

This also makes sense, but do we have any idea how to properly count
up cluster capacity and model it? I’m pretty sold on building a cost
function based on bandwidth and number of IOs required by each op, but
measuring that curve on each OSD and converting the individual values
into a total cluster capacity eludes me.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html