Hi Andrea! On Thu, 2008-08-07 at 09:46 +0200, Andrea Righi wrote: > Fernando Luis Vázquez Cao wrote: > > This RFC ended up being a bit longer than I had originally intended, but > > hopefully it will serve as the start of a fruitful discussion. > > Thanks for posting this detailed RFC! A few comments below. > > > As you pointed out, it seems that there is not much consensus building > > going on, but that does not mean there is a lack of interest. To get the > > ball rolling it is probably a good idea to clarify the state of things > > and try to establish what we are trying to accomplish. > > > > *** State of things in the mainstream kernel<BR> > > The kernel has had somewhat adavanced I/O control capabilities for quite > > some time now: CFQ. But the current CFQ has some problems: > > - I/O priority can be set by PID, PGRP, or UID, but... > > - ...all the processes that fall within the same class/priority are > > scheduled together and arbitrary grouping are not possible. > > - Buffered I/O is not handled properly. > > - CFQ's IO priority is an attribute of a process that affects all > > devices it sends I/O requests to. In other words, with the current > > implementation it is not possible to assign per-device IO priorities to > > a task. > > > > *** Goals > > 1. Cgroups-aware I/O scheduling (being able to define arbitrary > > groupings of processes and treat each group as a single scheduling > > entity). > > 2. Being able to perform I/O bandwidth control independently on each > > device. > > 3. I/O bandwidth shaping. > > 4. Scheduler-independent I/O bandwidth control. > > 5. Usable with stacking devices (md, dm and other devices of that > > ilk). > > 6. I/O tracking (handle buffered and asynchronous I/O properly). > > The same above also for IO operations/sec (bandwidth intended not only > in terms of bytes/sec), plus: > > 7. Optimal bandwidth usage: allow to exceed the IO limits to take > advantage of free/unused IO resources (i.e. allow "bursts" when the > whole physical bandwidth for a block device is not fully used and then > "throttle" again when IO from unlimited cgroups comes into place) > > 8. "fair throttling": avoid to throttle always the same task within a > cgroup, but try to distribute the throttling among all the tasks > belonging to the throttle cgroup Thank you for the ideas! By the way, point "3." above (I/O bandwidth shaping) refers to IO scheduling algorithms in general. When I wrote the RFC I thought that once we have the IO tracking and accounting mechanisms in place choosing and implementing an algorithm (fair throttling, proportional bandwidth scheduling, etc) would be easy in comparison, and that is the reason a list was not included. Once I get more feedback from all of you I will resend a more detailed RFC that will include your suggestions. > > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary > > groupings of processes and treat each group as a single scheduling > > identity) > > > > We obviously need this because our final goal is to be able to control > > the IO generated by a Linux container. The good news is that we already > > have the cgroups infrastructure so, regarding this problem, we would > > just have to transform our I/O bandwidth controller into a cgroup > > subsystem. > > > > This seems to be the easiest part, but the current cgroups > > infrastructure has some limitations when it comes to dealing with block > > devices: impossibility of creating/removing certain control structures > > dynamically and hardcoding of subsystems (i.e. resource controllers). > > This makes it difficult to handle block devices that can be hotplugged > > and go away at any time (this applies not only to usb storage but also > > to some SATA and SCSI devices). To cope with this situation properly we > > would need hotplug support in cgroups, but, as suggested before and > > discussed in the past (see (0) below), there are some limitations. > > > > Even in the non-hotplug case it would be nice if we could treat each > > block I/O device as an independent resource, which means we could do > > things like allocating I/O bandwidth on a per-device basis. As long as > > performance is not compromised too much, adding some kind of basic > > hotplug support to cgroups is probably worth it. > > > > (0) http://lkml.org/lkml/2008/5/21/12 > > What about using major,minor numbers to identify each device and account > IO statistics? If a device is unplugged we could reset IO statistics > and/or remove IO limitations for that device from userspace (i.e. by a > deamon), but pluggin/unplugging the device would not be blocked/affected > in any case. Or am I oversimplifying the problem? If a resource we want to control (a block device in this case) is hot-plugged/unplugged the corresponding cgroup-related structures inside the kernel need to be allocated/freed dynamically, respectively. The problem is that this is not always possible. For example, with the current implementation of cgroups it is not possible to treat each block device as a different cgroup subsytem/resource controlled, because subsystems are created at compile time. > > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects > > > > The implementation of an I/O scheduling algorithm is to a certain extent > > influenced by what we are trying to achieve in terms of I/O bandwidth > > shaping, but, as discussed below, the required accuracy can determine > > the layer where the I/O controller has to reside. Off the top of my > > head, there are three basic operations we may want perform: > > - I/O nice prioritization: ionice-like approach. > > - Proportional bandwidth scheduling: each process/group of processes > > has a weight that determines the share of bandwidth they receive. > > - I/O limiting: set an upper limit to the bandwidth a group of tasks > > can use. > > Use a deadline-based IO scheduling could be an interesting path to be > explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth > requirements. Please note that the only thing we can do is to guarantee minimum bandwidth requirement when there is contention for an IO resource, which is precisely what a proportional bandwidth scheduler does. An I missing something? -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel