Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller patches)

Andrea Righi <righi.andrea@xxxxxxxxx> · Thu, 7 Aug 2008 09:46:07 +0200 (MEST)



Fernando Luis Vázquez Cao wrote:> This RFC ended up being a bit longer than I had originally intended, but> hopefully it will serve as the start of a fruitful discussion.
Thanks for posting this detailed RFC! A few comments below.
> As you pointed out, it seems that there is not much consensus building> going on, but that does not mean there is a lack of interest. To get the> ball rolling it is probably a good idea to clarify the state of things> and try to establish what we are trying to accomplish.> > *** State of things in the mainstream kernel<BR>> The kernel has had somewhat adavanced I/O control capabilities for quite> some time now: CFQ. But the current CFQ has some problems:>   - I/O priority can be set by PID, PGRP, or UID, but...>   - ...all the processes that fall within the same class/priority are> scheduled together and arbitrary grouping are not possible.>   - Buffered I/O is not handled properly.>   - CFQ's IO priority is an attribute of a process that affects all> devices it sends I/O requests to. In other words, with the current> implementation it is not possible to assign per-device IO priorities to> a task.> > *** Goals>   1. Cgroups-aware I/O scheduling (being able to define arbitrary> groupings of processes and treat each group as a single scheduling> entity).>   2. Being able to perform I/O bandwidth control independently on each> device.>   3. I/O bandwidth shaping.>   4. Scheduler-independent I/O bandwidth control.>   5. Usable with stacking devices (md, dm and other devices of that> ilk).>   6. I/O tracking (handle buffered and asynchronous I/O properly).
The same above also for IO operations/sec (bandwidth intended not onlyin terms of bytes/sec), plus:
7. Optimal bandwidth usage: allow to exceed the IO limits to takeadvantage of free/unused IO resources (i.e. allow "bursts" when thewhole physical bandwidth for a block device is not fully used and then"throttle" again when IO from unlimited cgroups comes into place)
8. "fair throttling": avoid to throttle always the same task within acgroup, but try to distribute the throttling among all the tasksbelonging to the throttle cgroup
> The list of goals above is not exhaustive and it is also likely to> contain some not-so-nice-to-have features so your feedback would be> appreciated.> > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary> groupings of processes and treat each group as a single scheduling> identity)> > We obviously need this because our final goal is to be able to control> the IO generated by a Linux container. The good news is that we already> have the cgroups infrastructure so, regarding this problem, we would> just have to transform our I/O bandwidth controller into a cgroup> subsystem.> > This seems to be the easiest part, but the current cgroups> infrastructure has some limitations when it comes to dealing with block> devices: impossibility of creating/removing certain control structures> dynamically and hardcoding of subsystems (i.e. resource controllers).> This makes it difficult to handle block devices that can be hotplugged> and go away at any time (this applies not only to usb storage but also> to some SATA and SCSI devices). To cope with this situation properly we> would need hotplug support in cgroups, but, as suggested before and> discussed in the past (see (0) below), there are some limitations.> > Even in the non-hotplug case it would be nice if we could treat each> block I/O device as an independent resource, which means we could do> things like allocating I/O bandwidth on a per-device basis. As long as> performance is not compromised too much, adding some kind of basic> hotplug support to cgroups is probably worth it.>> (0) http://lkml.org/lkml/2008/5/21/12
What about using major,minor numbers to identify each device and accountIO statistics? If a device is unplugged we could reset IO statisticsand/or remove IO limitations for that device from userspace (i.e. by adeamon), but pluggin/unplugging the device would not be blocked/affectedin any case. Or am I oversimplifying the problem?
> 3. & 4. & 5. - I/O bandwidth shaping & General design aspects> > The implementation of an I/O scheduling algorithm is to a certain extent> influenced by what we are trying to achieve in terms of I/O bandwidth> shaping, but, as discussed below, the required accuracy can determine> the layer where the I/O controller has to reside. Off the top of my> head, there are three basic operations we may want perform:>   - I/O nice prioritization: ionice-like approach.>   - Proportional bandwidth scheduling: each process/group of processes> has a weight that determines the share of bandwidth they receive.>   - I/O limiting: set an upper limit to the bandwidth a group of tasks> can use.
Use a deadline-based IO scheduling could be an interesting path to beexplored as well, IMHO, to try to guarantee per-cgroup minimum bandwidthrequirements.
> > If we are pursuing a I/O prioritization model à la CFQ the temptation is> to implement it at the elevator layer or extend any of the existing I/O> schedulers.> > There have been several proposals that extend either the CFQ scheduler> (see (1), (2) below) or the AS scheduler (see (3) below). The problem> with these controllers is that they are scheduler dependent, which means> that they become unusable when we change the scheduler or when we want> to control stacking devices which define their own make_request_fn> function (md and dm come to mind). It could be argued that the physical> devices controlled by a dm or md driver are likely to be fed by> traditional I/O schedulers such as CFQ, but these I/O schedulers would> be running independently from each other, each one controlling its own> device ignoring the fact that they part of a stacking device. This lack> of information at the elevator layer makes it pretty difficult to obtain> accurate results when using stacking devices. It seems that unless we> can make the elevator layer aware of the topology of stacking devices> (possibly by extending the elevator API?) evelator-based approaches do> not constitute a generic solution. Here onwards, for discussion> purposes, I will refer to this type of I/O bandwidth controllers as> elevator-based I/O controllers.> > A simple way of solving the problems discussed in the previous paragraph> is to perform I/O control before the I/O actually enters the block layer> either at the pagecache level (when pages are dirtied) or at the entry> point to the generic block layer (generic_make_request()). Andrea's I/O> throttling patches stick to the former variant (see (4) below) and> Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later> approach. The rationale is that by hooking into the source of I/O> requests we can perform I/O control in a topology-agnostic and> elevator-agnostic way. I will refer to this new type of I/O bandwidth> controller as block layer I/O controller.> > By residing just above the generic block layer the implementation of a> block layer I/O controller becomes relatively easy, but by not taking> into account the characteristics of the underlying devices we might risk> underutilizing them. For this reason, in some cases it would probably> make sense to complement a generic I/O controller with elevator-based> I/O controller, so that the maximum throughput can be squeezed from the> physical devices.> > (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/> (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/> (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/> (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975> (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581> > 6.- I/O tracking> > This is arguably the most important part, since to perform I/O control> we need to be able to determine where the I/O is coming from.> > Reads are trivial because they are served in the context of the task> that generated the I/O. But most writes are performed by pdflush,> kswapd, and friends so performing I/O control just in the synchronous> I/O path would lead to large inaccuracy. To get this right we would need> to track ownership all the way up to the pagecache page. In other words,> it is necessary to track who is dirtying pages so that when they are> written to disk the right task is charged for that I/O.> > Fortunately, such tracking of pages is one of the things the existing> memory resource controller is doing to control memory usage. This is a> clever observation which has a useful implication: if the rather> imbricated tracking and accounting parts of the memory resource> controller were split the I/O controller could leverage the existing> infrastructure to track buffered and asynchronous I/O. This is exactly> what the bio-cgroup (see (6) below) patches set out to do.> > It is also possible to do without I/O tracking. For that we would need> to hook into the synchronous I/O path and every place in the kernel> where pages are dirtied (see (4) above for details). However controlling> the rate at which a cgroup can generate dirty pages seems to be a task> that belongs in the memory controller not the I/O controller. As Dave> and Paul suggested its probably better to delegate this to the memory> controller. In fact, it seems that Yamamoto-san is cooking some patches> that implement just that: dirty balancing for cgroups (see (7) for> details).> > Another argument in favor of I/O tracking is that not only block layer> I/O controllers would benefit from it, but also the existing I/O> schedulers and the elevator-based I/O controllers proposed by> Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself> are working on this and hopefully will be sending patches soon).> > (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90> (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/> > *** How to move on> > As discussed before, it probably makes sense to have both a block layer> I/O controller and a elevator-based one, and they could certainly> cohabitate. As discussed before, all of them need I/O tracking> capabilities so I would like to suggest the plan below to get things> started:> >   - Improve the I/O tracking patches (see (6) above) until they are in> mergeable shape.>   - Fix CFQ and AS to use the new I/O tracking functionality to show its> benefits. If the performance impact is acceptable this should suffice to> convince the respective maintainer and get the I/O tracking patches> merged.>   - Implement a block layer resource controller. dm-ioband is a working> solution and feature rich but its dependency on the dm infrastructure is> likely to find opposition (the dm layer does not handle barriers> properly and the maximum size of I/O requests can be limited in some> cases). In such a case, we could either try to build a standalone> resource controller based on dm-ioband (which would probably hook into> generic_make_request) or try to come up with something new.>   - If the I/O tracking patches make it into the kernel we could move on> and try to get the Cgroup extensions to CFQ and AS mentioned before (see> (1), (2), and (3) above for details) merged.>   - Delegate the task of controlling the rate at which a task can> generate dirty pages to the memory controller.> > This RFC is somewhat vague but my feeling is that we build some> consensus on the goals and basic design aspects before delving into> implementation details.> > I would appreciate your comments and feedback.
Very nice RFC.
-Andrea_______________________________________________Virtualization mailing listVirtualization@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx://lists.linux-foundation.org/mailman/listinfo/virtualization