RFC: I/O bandwidth controller (was Re: Too many I/O controller patches)

Fernando Luis Vázquez Cao <fernando@xxxxxxxxxxxxx> · Wed, 06 Aug 2008 10:13:09 +0900



On Mon, 2008-08-04 at 10:20 -0700, Dave Hansen wrote: > On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:> > This series of patches of dm-ioband now includes "The bio tracking mechanism,"> > which has been posted individually to this mailing list.> > This makes it easy for anybody to control the I/O bandwidth even when> > the I/O is one of delayed-write requests.> > During the Containers mini-summit at OLS, it was mentioned that there> are at least *FOUR* of these I/O controllers floating around.  Have you> talked to the other authors?  (I've cc'd at least one of them).> > We obviously can't come to any kind of real consensus with people just> tossing the same patches back and forth.> > -- Dave
Hi Dave,
I have been tracking the memory controller patches for a while whichspurred my interest in cgroups and prompted me to start working on I/Obandwidth controlling mechanisms. This year I have had severalopportunities to discuss the design challenges of i/o controllers withthe NEC and VALinux Japan teams (CCed), most recently last month duringthe Linux Foundation Japan Linux Symposium, where we took advantage ofAndrew Morton's visit to Japan to do some brainstorming on this topic. Iwill try so summarize what was discussed there (and in the Linux Storage& Filesystem Workshop earlier this year) and propose a hopefullyacceptable way to proceed and try to get things started.
This RFC ended up being a bit longer than I had originally intended, buthopefully it will serve as the start of a fruitful discussion.
As you pointed out, it seems that there is not much consensus buildinggoing on, but that does not mean there is a lack of interest. To get theball rolling it is probably a good idea to clarify the state of thingsand try to establish what we are trying to accomplish.
*** State of things in the mainstream kernel<BR>The kernel has had somewhat adavanced I/O control capabilities for quitesome time now: CFQ. But the current CFQ has some problems:  - I/O priority can be set by PID, PGRP, or UID, but...  - ...all the processes that fall within the same class/priority arescheduled together and arbitrary grouping are not possible.  - Buffered I/O is not handled properly.  - CFQ's IO priority is an attribute of a process that affects alldevices it sends I/O requests to. In other words, with the currentimplementation it is not possible to assign per-device IO priorities toa task.
*** Goals  1. Cgroups-aware I/O scheduling (being able to define arbitrarygroupings of processes and treat each group as a single schedulingentity).  2. Being able to perform I/O bandwidth control independently on eachdevice.  3. I/O bandwidth shaping.  4. Scheduler-independent I/O bandwidth control.  5. Usable with stacking devices (md, dm and other devices of thatilk).  6. I/O tracking (handle buffered and asynchronous I/O properly).
The list of goals above is not exhaustive and it is also likely tocontain some not-so-nice-to-have features so your feedback would beappreciated.
1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrarygroupings of processes and treat each group as a single schedulingidentity)
We obviously need this because our final goal is to be able to controlthe IO generated by a Linux container. The good news is that we alreadyhave the cgroups infrastructure so, regarding this problem, we wouldjust have to transform our I/O bandwidth controller into a cgroupsubsystem.
This seems to be the easiest part, but the current cgroupsinfrastructure has some limitations when it comes to dealing with blockdevices: impossibility of creating/removing certain control structuresdynamically and hardcoding of subsystems (i.e. resource controllers).This makes it difficult to handle block devices that can be hotpluggedand go away at any time (this applies not only to usb storage but alsoto some SATA and SCSI devices). To cope with this situation properly wewould need hotplug support in cgroups, but, as suggested before anddiscussed in the past (see (0) below), there are some limitations.
Even in the non-hotplug case it would be nice if we could treat eachblock I/O device as an independent resource, which means we could dothings like allocating I/O bandwidth on a per-device basis. As long asperformance is not compromised too much, adding some kind of basichotplug support to cgroups is probably worth it.
(0) http://lkml.org/lkml/2008/5/21/12
3. & 4. & 5. - I/O bandwidth shaping & General design aspects
The implementation of an I/O scheduling algorithm is to a certain extentinfluenced by what we are trying to achieve in terms of I/O bandwidthshaping, but, as discussed below, the required accuracy can determinethe layer where the I/O controller has to reside. Off the top of myhead, there are three basic operations we may want perform:  - I/O nice prioritization: ionice-like approach.  - Proportional bandwidth scheduling: each process/group of processeshas a weight that determines the share of bandwidth they receive.  - I/O limiting: set an upper limit to the bandwidth a group of taskscan use.
If we are pursuing a I/O prioritization model à la CFQ the temptation isto implement it at the elevator layer or extend any of the existing I/Oschedulers.
There have been several proposals that extend either the CFQ scheduler(see (1), (2) below) or the AS scheduler (see (3) below). The problemwith these controllers is that they are scheduler dependent, which meansthat they become unusable when we change the scheduler or when we wantto control stacking devices which define their own make_request_fnfunction (md and dm come to mind). It could be argued that the physicaldevices controlled by a dm or md driver are likely to be fed bytraditional I/O schedulers such as CFQ, but these I/O schedulers wouldbe running independently from each other, each one controlling its owndevice ignoring the fact that they part of a stacking device. This lackof information at the elevator layer makes it pretty difficult to obtainaccurate results when using stacking devices. It seems that unless wecan make the elevator layer aware of the topology of stacking devices(possibly by extending the elevator API?) evelator-based approaches donot constitute a generic solution. Here onwards, for discussionpurposes, I will refer to this type of I/O bandwidth controllers aselevator-based I/O controllers.
A simple way of solving the problems discussed in the previous paragraphis to perform I/O control before the I/O actually enters the block layereither at the pagecache level (when pages are dirtied) or at the entrypoint to the generic block layer (generic_make_request()). Andrea's I/Othrottling patches stick to the former variant (see (4) below) andTsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the laterapproach. The rationale is that by hooking into the source of I/Orequests we can perform I/O control in a topology-agnostic andelevator-agnostic way. I will refer to this new type of I/O bandwidthcontroller as block layer I/O controller.
By residing just above the generic block layer the implementation of ablock layer I/O controller becomes relatively easy, but by not takinginto account the characteristics of the underlying devices we might riskunderutilizing them. For this reason, in some cases it would probablymake sense to complement a generic I/O controller with elevator-basedI/O controller, so that the maximum throughput can be squeezed from thephysical devices.
(1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/(2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/(3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/(4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975(5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
6.- I/O tracking
This is arguably the most important part, since to perform I/O controlwe need to be able to determine where the I/O is coming from.
Reads are trivial because they are served in the context of the taskthat generated the I/O. But most writes are performed by pdflush,kswapd, and friends so performing I/O control just in the synchronousI/O path would lead to large inaccuracy. To get this right we would needto track ownership all the way up to the pagecache page. In other words,it is necessary to track who is dirtying pages so that when they arewritten to disk the right task is charged for that I/O.
Fortunately, such tracking of pages is one of the things the existingmemory resource controller is doing to control memory usage. This is aclever observation which has a useful implication: if the ratherimbricated tracking and accounting parts of the memory resourcecontroller were split the I/O controller could leverage the existinginfrastructure to track buffered and asynchronous I/O. This is exactlywhat the bio-cgroup (see (6) below) patches set out to do.
It is also possible to do without I/O tracking. For that we would needto hook into the synchronous I/O path and every place in the kernelwhere pages are dirtied (see (4) above for details). However controllingthe rate at which a cgroup can generate dirty pages seems to be a taskthat belongs in the memory controller not the I/O controller. As Daveand Paul suggested its probably better to delegate this to the memorycontroller. In fact, it seems that Yamamoto-san is cooking some patchesthat implement just that: dirty balancing for cgroups (see (7) fordetails).
Another argument in favor of I/O tracking is that not only block layerI/O controllers would benefit from it, but also the existing I/Oschedulers and the elevator-based I/O controllers proposed byUchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myselfare working on this and hopefully will be sending patches soon).
(6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90(7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/
*** How to move on
As discussed before, it probably makes sense to have both a block layerI/O controller and a elevator-based one, and they could certainlycohabitate. As discussed before, all of them need I/O trackingcapabilities so I would like to suggest the plan below to get thingsstarted:
  - Improve the I/O tracking patches (see (6) above) until they are inmergeable shape.  - Fix CFQ and AS to use the new I/O tracking functionality to show itsbenefits. If the performance impact is acceptable this should suffice toconvince the respective maintainer and get the I/O tracking patchesmerged.  - Implement a block layer resource controller. dm-ioband is a workingsolution and feature rich but its dependency on the dm infrastructure islikely to find opposition (the dm layer does not handle barriersproperly and the maximum size of I/O requests can be limited in somecases). In such a case, we could either try to build a standaloneresource controller based on dm-ioband (which would probably hook intogeneric_make_request) or try to come up with something new.  - If the I/O tracking patches make it into the kernel we could move onand try to get the Cgroup extensions to CFQ and AS mentioned before (see(1), (2), and (3) above for details) merged.  - Delegate the task of controlling the rate at which a task cangenerate dirty pages to the memory controller.
This RFC is somewhat vague but my feeling is that we build someconsensus on the goals and basic design aspects before delving intoimplementation details.
I would appreciate your comments and feedback.
- Fernando
_______________________________________________Containers mailing listContainers@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx://lists.linux-foundation.org/mailman/listinfo/containers