On Mon, Mar 19, 2018 at 2:09 PM, Byung Su Park <pbs1108@xxxxxxxxx> wrote: > Hi, Sage, > > I have left some comments in the middle of writing. > > 2018-03-16 6:47 GMT+09:00 Sage Weil <sweil@xxxxxxxxxx>: >> I've been thinking a bit about how the three [d]mclock modes (client, >> pool, op type) modes can be combined. We'd talked before about a >> hierarchical mode of some sort. Just thinking a bit about how I'd >> want/expect the various reservations and limits to interact, though, I'm >> not sure how well that will work. >> >> For example, let's look at just the op type vs client modes. On the op >> front, I would expect 2 categories of work: client work and background >> work (scrub, recovery, snap trimming, etc.). And I would expect the >> background "knob" to be something like "max of 20% background work" (or >> rather, "reduce overall client throughput by no more than 20%"). I would >> also expect/want something like "reserve at least 10% for background work >> to make forward progress". > > As discussed in your comments and at > https://github.com/ceph/ceph/pull/18280, we agree that it is > structured as follows. > > Primary Level: op type vs. client modes (i.e. 30% vs. 70%). > Second Level: op type: scrub, recovery, snap trimming, etc. within 30%. > client modes: client1 (or pool1), client 2, > client 3, ... , within 70%. > > And, as above, we discussed in pull request 18280 that we need to > construct a double mclock queue to get it in hierarchy form. > Also, based on the need for these things, the author of mClock has > already published a paper called "hClock: hierarchical QoS for packet > scheduling in a hypervisor". > (https://www.yumpu.com/en/document/view/51631478/hclock-hierarchical-qos-for-packet-scheduling-in-a-eurosys-2013/3). > > As Eric said earlier, if we currently have mclock in a hierarchy form, > I think it will be possible to have a above configuration. > To implement this, implementation discussion will be needed as shown below. > > 1. Implement two hierarchical structures within the dmclock library. > 2. Like the 18280 PR, two layers (external & internal) of dmclock > queue are used in Ceph. This basically makes sense on its own, but how do we deal with client-demanded recovery in a multi-level system? That is, if the client wants object foo, and foo needs to be recovered from another OSD, how does it move through the mclock hierarchies? >> It's not clear to me that either of those limits "fit" into the mclock >> model of a reservation or priority; they're a bit of a special case. >> >> Similarly, in the pool policy case, I'm struggling to imagine how that >> would be configured. It doesn't make sense to have a "reservation" for >> pools since we don't have a global view. We could have a priority >> relative to other pools (there'd need to be a 'default' priority for pools >> that don't have it set). Or, we could have a similar percentage-style >> reservation like the above. And these would probably be subserviant to >> the client reservations (i.e., come out of the leftover by-priority >> phase). >> > As you said, hierarchy style and percentage-style qos settings are not > immediately available for mClock. > The hierarchy style should be implemented as a multi-level > implementation of mClock as described above. > In my opinion, the percentage-style QoS configuration would be > possible if we were doing total throughput estimation as you said. > > For example, suppose throughput per OSD according to SSD or HDD, and > total throughput is determined every time OSD is added. > When we suppose 5 IOPs per SSD type OSD, the 16 OSD ceph cluster has > about 80 IOPs throughput. > When 10% of the background is allocated to the background I/O, 8 IOPs > are internally set to the background IO in dmClock. > This assumption of total throughput per OSD can be referenced in the > QoS configuration of the Netapps's SolidFire storage system. > (https://www.slideshare.net/NetApp/gain-storage-control-with-sioc-and-take-performance-control-with-qos-from-solidfire-76354056#14). > (In addition, SolidFire QoS configuration sets QoS through IOPs per > block configuration automatically.). This also makes sense, but do we have any idea how to properly count up cluster capacity and model it? I’m pretty sold on building a cost function based on bandwidth and number of IOs required by each op, but measuring that curve on each OSD and converting the individual values into a total cluster capacity eludes me. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html