On Fri, Mar 02, 2012 at 10:33:23AM -0500, Vivek Goyal wrote: > On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote: > > Committee members, > > > > Please consider inviting me to the Storage, Filesystem, & MM Summit. I > > am working for one of the kernel teams in SUSE Labs focusing on Network > > filesystems and block layer. > > > > Recently, I have been trying to solve the problem of "throttling > > buffered writes" to make per-cgroup throttling of IO to the device > > possible. Currently the block IO controller does not throttle buffered > > writes. The writes would have lost the submitter's context (I/O comes in > > flusher thread's context) when they are at the block IO layer. I looked > > at the past work and many folks have attempted to solve this problem in > > the past years but this problem remains unsolved so far. > > > > First, Andrea Righi tried to solve this by limiting the rate of async > > writes at the time a task is generating dirty pages in the page cache. > > > > Next, Vivek Goyal tried to solve this by throttling writes at the time > > they are entering the page cache. > > > > Both these approches have limitations and not considered for merging. > > > > I have looked at the possibility of solving this at the filesystem level > > but the problem with ext* filesystems is that a commit will commit the > > whole transaction at once (which may contain writes from > > processes belonging to more than one cgroup). Making filesystems cgroup > > aware would need redesign of journalling layer itself. > > > > Dave Chinner thinks this problem should be solved and being solved in a > > different manner by making the bdi-flusher writeback cgroup aware. > > > > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM > > summit this year) adds cgroup awareness to writeback. Some aspects of > > this patchset could be borrowed for solving the problem of throttling > > buffered writes. > > > > As I understand the topic was discussed during last Kernel Summit as > > well and the idea is to get the IO-less throttling patchset into the > > kernel, then do per-memcg dirty memory limiting and add some memcg > > awareness to writeback Greg Thelen and then when these things settle > > down, think how to solve this problem since noone really seem to have a > > good answer to it. > > > > Having worked on linux filesystem/storage area for a few years now and > > having spent time understanding the various approaches tried and looked > > at other feasible way of solving this problem, I look forward to > > participate in the summit and discussions. > > > > So, the topic I would like to discuss is solving the problem of > > "throttling buffered writes". This could considered for discussion with > > memcg writeback session if that topic has been allocated a slot. > > > > I'm aware that this is a late submission and my apologies for not making > > it earlier. But, I want to take chances and see if it is possible still.. > > This is an interesting and complicated topic. As you mentioned we have had > tried to solve it but nothing has been merged yet. Personally, I am still > interested in having a discussion and see if we can come up with a way > forward. I'm interested, too. Here is my attempt on the problem a year ago: blk-cgroup: async write IO controller ("buffered write" would be more precise) https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d https://lkml.org/lkml/2011/4/4/205 > Because filesystems are not cgroup aware, throtting IO below filesystem > has dangers of IO of faster cgroups being throttled behind slower cgroup > (journalling was one example and there could be others). Hence, I personally > think that this problem should be solved at higher layer and that is when > we are actually writting to the cache. That has the disadvantage of still > seeing IO spikes at the device but I guess we live with that. Doing it > at higher layer also allows to use the same logic for NFS too otherwise > NFS buffered write will continue to be a problem. Totally agreed. > In case of memory controller it jsut becomes a write to memory issue, > and not sure if notion of dirty_ratio and dirty_bytes is enough or we > need to rate limit the write to memory. In a perfect world, the dirty size and rate may be each balanced around their targets. Ideally we could independently limit dirty size in memcg context and limit dirty rate in blkcg. If the user want to control both size/rate, he may put tasks into memcg as well as blkcg. In reality the dirty size limit will impact the dirty rate, because memcg needs to adjust its tasks' balanced dirty rate to drive the memcg dirty size to the target, so does the global dirty target. Comparing to the global dirty size balancing, memcg suffers from a unique problem: given N memcg each running a dd task, each memcg's dirty size will be dropping suddenly on every (N/2) seconds. Because the flusher writeout the inodes in coarse time-split round-robin fashion, with up to (bdi->write_bandwidth/2) chunk size. That sudden drop of memcg dirty pages may drive the dirty size far from the target, as a result it will need to adjust the dirty rate heavily in order to drive the dirty size back to the target. So the memcg dirty size balance may create large fluctuations in the dirty rates, and even long stall time of the memcg tasks. What's more, due to the uncontrollable way the flusher walks through the dirty pages and how the dirty pages distribute among the dirty inodes and memcgs, the dirty rate will be impacted heavily by the workload and behavior of the flusher when enforcing the dirty size target. There are no satisfactory solution to this till now. Currently I'm trying to shun away from this and look into improving the page reclaim so that it can work well with LRU lists with half pages being dirty/writeback. Then the 20% global dirty limit should be enough to serve most memcg tasks well taking into account the unevenly distributed dirty pages among different memcg and NUMA zones/nodes. There may still be few memcgs that need further dirty throttling, but they are likely mainly consist of heavy dirtiers and can afford less smoothness and longer delays. In comparison, the dirty rate limit for buffered writes seems less convolved to me. It sure has its own problems, so we see several solutions in circular, each with its unique trade offs. But at least we have relative simple solutions that work to their design goals. > Anyway, ideas to have better control of write rates are welcome. We have > seen issues wheren a virtual machine cloning operation is going on and > we also want a small direct write to be on disk and it can take a long > time with deadline. CFQ should still be fine as direct IO is synchronous > but deadline treats all WRITEs the same way. > > May be deadline should be modified to differentiate between SYNC and ASYNC > IO instead of READ/WRITE. Jens? In general users definitely need higher priorities for SYNC writes. It will also enable the "buffered write I/O controller" and "direct write I/O controller" to co-exist well and operate independently this way: the direct writes always enjoy higher priority than the flusher, but will be rate limited by the already upstreamed blk-cgroup I/O controller. The remaining disk bandwidth will be split among the buffered write tasks by another I/O controller operating at the balance_dirty_pages() level. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>