On Fri, Apr 22, 2011 at 12:21:23PM +0800, Wu Fengguang wrote: [..] > > > BTW, I'd like to advocate balance_dirty_pages() based IO controller :) > > > > > > > Actually implementing throttling in balance_dirty_pages() is not hard. I > > think it has following issues. > > > > - One controls the IO rate coming into the page cache and does not control > > the IO rate at the outgoing devices. So a flusher thread can still throw > > lots of writes at a device and completely disrupting read latencies. > > > > If buffered WRITES can disrupt READ latencies unexpectedly, then it kind > > of renders IO controller/throttling useless. > > Hmm..I doubt IO controller is the right solution to this problem at all. > > It's such a fundamental problem that it would be Linux's failure to > recommend normal users to use IO controller for the sake of good read > latencies in the presence of heavy writes. It is and we have modified CFQ a lot to tackle that but still... Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root disk and then try to launch firefox and browse few websites and see if you are happy with the response of the firefox. It took me more than a minute to launch firefox and be able to input and load first website. But I agree that READ latencies in presence of WRITES can be a problem independent of IO controller. Also there is another case of cluster where IO is coming to storage from multiple hosts and one does not probably want a flurry of WRITES from one host so that IO of other hosts is not severely impacted. In that case IO scheduler can't do much as it has the view of single system. Secondly, the whole thing with IO controller is that it provides user more control of IO instead of living with a default system specific policy. For example, an admin might want to just look for better latencies for READS and is willing to give up on WRITE throughput. So if IO controller is properly implemented, he might say that my WRITE intensive application I am putting in a cgroup with WRITE limit of 20MB/s. Now the READ latencies in root cgroup should be better and may be predictable too as we know the WRITE rate to disk never exceedes 20MB/s. Also it is only CFQ which provides READS so much preferrence over WRITES. deadline and noop do not which we typically use on faster storage. There we might take a bigger hit on READ latencies depending on what storage is and how effected it is with a burst of WRITES. I guess it boils down to better system control and better predictability. So I think throttling buffered writes in balance_dirty_pages() is better than not providing any way to control buffered WRITES at all but controlling it at end device provides much better control on IO and serves more use cases. > > It actually helps reducing seeks when the flushers submit async write > requests in bursts (eg. 1 second). It will then kind of optimally > "work on this bdi area on behalf of this flusher for 1 second, and > then to the other area for 1 second...". The IO scheduler should have > similar optimizations, which should generally work better with more > clustered data supplies from the flushers. (Sorry I'm not tracking the > cfq code, so it's all general hypothesis and please correct me...) > Isolation and throughput are orthogonal. You go for better isolation and you will esentially pay by reduced throughput. Now as a user one can decide what are his priorities. I see it as a slider where on one end it is 100% isolation and on other end it is 100% throughput. Now a user can slide the slider and keep that somewhere in between depending on his/her needs. One of the goals of IO controller is to provide that fine grained control. By implementing throttling in balance_dirty_pages() we really lose that capability. Also flusher still will submit the requests in burst. flusher will still pick one inode at a time so IO is as sequential as possible. We will still do the IO-lesss throttling to reduce the seeks. If we do IO throttling below page cache, it also gives us the capability to control flusher IO burst. Gives user a fine grained control which is lost if we do the control while entering page cache. > The IO scheduler looks like the right owner to safeguard read latencies. > Where you already have the commit 365722bb917b08b7 ("cfq-iosched: > delay async IO dispatch, if sync IO was just done") and friends. > They do such a good job that if there are continual reads, the async > writes will be totally starved. > > But yeah that still leaves sporadic reads at the mercy of heavy > writes, where the default policy will prefer write throughput to read > latencies. Well, there is no default policy as such. CFQ tries to prioritize READs as much as it can. Deadline does not as much. So as I said previously, we really are not controlling the burst. We are leaving it to IO scheduler to handle it as per its policy and lose isolation between the groups which is primary purpose of IO controller. IOW, doing throttling below page cache allows us much better/smoother control of IO. > > And there is the "no heavy writes to saturate the disk in long term, > but still temporal heavy writes created by the bursty flushing" case. > In this case the device level throttling has the nice side effect of > smoothing writes out without performance penalties. However, if it's > so useful so that you regard it as an important target, why not build > some smoothing logic into the flushers? It has the great prospect of > benefiting _all_ users _by default_ :) We already have implemented the control at lower layers. So we really don't have to build secondary control now. Just that rest of the subsystems have to be aware of cgroups and play nicely. At high level smoothing logic is just another throttling technique. Whether to throttle process abruptly or try to put more complex technique to smooth out the traffic. It is just a knob. The key question here is where to put the knob in stack for maximum degree of control. flusher logic is already complicated. I am not sure what we will gain by training flushers about the IO rate and throttling it based on user policies. We can let lower layers do it as long as we can make sure flusher is aware of cgroups and can select inodes to flush in such a manner that it does not get blocked behind slow cgroups and can keep all the cgroups busy. The challenge I am facing here is the file system dependencies on IO. One example is that if I throttle fsync IO, then it leads to issues with journalling and other IO in filesystem seems to be stopping. > > > - For the application performance, I thought a better mechanism would be > > that we come up with per cgroup dirty ratio. This is equivalent to > > partitioning the page cache and coming up with cgroup's share. Now > > an application can write to this cache as fast as it want and is only > > throttled either by balance_dirty_pages() rules. > > > > All this IO must be going to some device and if an admin has put this cgroup > > in a low bandwidth group, then pages from this cgroup will be written > > slowly hence tasks in this group will be blocked for longer time. > > > > If we can make this work, then application can write to cache at higher > > rate at the same time not create a havoc at the end device. > > The memcg dirty ratio is fundamentally different from blkio > throttling. The former aims to eliminate excessive pageout()s when > reclaiming pages from the memcg LRU lists. It treats "dirty pages" as > throttle goal, and has the side effect throttling the task at the rate > the memcg's dirty inodes can be flushed to disk. Its complexity > originates from the correlation with "how the flusher selects the > inodes to writeout". Unfortunately the flusher by nature works in a > coarse way.. memcg dirty ratio is a different problem but it needs to work with IO controller to solve the whole issue. If IO was just direct IO, and no page cache in picture we don't need memcg. But the momemnt, page cache comes into the picture, immediately comes the notion of logically dividing that page cache among cgroups. And comes the notion of dirty ratio per cgroup so that even if the overall cache usage is less but this cgroups has consumed its share of dirty pages and now we need to throttle it and ask flusher to send IO to underlying devices. IO controller is sitting below page cache. So we need to make sure that memcg is enhanced to support per cgroup dirt ratio, and train flusher threads so that they are aware of cgroup presence and can do writeout in per memcg aware manner. Greg Thelen is working on putting these two pieces together. So memcg dirty ratio is a different problem but is required to make IO controller work for buffered WRITES. > > OTOH, blkio-cgroup don't need to care about inode selection at all. > It's enough to account and throttle tasks' dirty rate, and let the > flusher freely work on whatever dirtied inodes. That goes back to the model of putting the knob in balance_dirty_pages(). Yes it simplifies the implementation but also takes away the capability of better control. One would still see the burst of WRITES at end devices. > > In this manner, blkio-cgroup dirty rate throttling is more user > oriented. While memcg dirty pages throttling looks like a complex > solution to some technical problems (if me understand it right). If we implement IO throttling in balance_dirty_pages(), then we don't require memcg dirty ratio thing for it to work. But we will still reuire memcg dirty ratio for other reasons. - Prportional IO control for CFQ - memcg's own problems of starting to write out pages from a cgroup earlier. > > The blkio-cgroup dirty throttling code can mainly go to > page-writeback.c, while the memcg code will mainly go to > fs-writeback.c (balance_dirty_pages() will also be involved, but > that's actually a more trivial part). > > The correlations seem to be, > > - you can get the page tagging functionality from memcg, if doing > async write throttling at device level > > - the side effect of rate limiting by memcg's dirty pages throttling, > which is much less controllable than blkio-cgroup's rate limiting Well, I thought memcg's per cgroup ratio and IO controller's rate limit will work together. memcgroup will keep track of per cgroup share of page cache and when caches usage is more than certain %, it will ask flusher to send IO to device and then IO controller will throttle that IO. Now if rate limit of the cgroup is less, then task of that cgroup will be throttled for longer in balance_dirty_pages(). So throttling is happening at two layers. One throttling is in balance_dirty_pages() which is actually not dependent on user inputted parameters. It is more dependent on what's the page cache share of this cgroup and what's the effecitve IO rate this cgroup is getting. The real IO throttling is happning at device level which is dependent on parameters inputted by user and which in-turn indirectly should decide how tasks are throttled in balance_dirty_pages(). I have yet to look at your implementation of throttling but keep in mind that once IO controller comes into picture the throttling/smoothing mechanism also needs to be able to take into account direct writes and we should be able to use same algorithms for throttling READS. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html