Re: [ATTEND] [LSF/MM TOPIC] Buffered writes throttling

Fengguang Wu <fengguang.wu@xxxxxxxxx> · Mon, 5 Mar 2012 11:22:26 -0800

On Fri, Mar 02, 2012 at 10:33:23AM -0500, Vivek Goyal wrote:
> On Fri, Mar 02, 2012 at 12:48:43PM +0530, Suresh Jayaraman wrote:
> > Committee members,
> > 
> > Please consider inviting me to the Storage, Filesystem, & MM Summit. I
> > am working for one of the kernel teams in SUSE Labs focusing on Network
> > filesystems and block layer.
> > 
> > Recently, I have been trying to solve the problem of "throttling
> > buffered writes" to make per-cgroup throttling of IO to the device
> > possible. Currently the block IO controller does not throttle buffered
> > writes. The writes would have lost the submitter's context (I/O comes in
> > flusher thread's context) when they are at the block IO layer. I looked
> > at the past work and many folks have attempted to solve this problem in
> > the past years but this problem remains unsolved so far.
> > 
> > First, Andrea Righi tried to solve this by limiting the rate of async
> > writes at the time a task is generating dirty pages in the page cache.
> > 
> > Next, Vivek Goyal tried to solve this by throttling writes at the time
> > they are entering the page cache.
> > 
> > Both these approches have limitations and not considered for merging.
> > 
> > I have looked at the possibility of solving this at the filesystem level
> > but the problem with ext* filesystems is that a commit will commit the
> > whole transaction at once (which may contain writes from
> > processes belonging to more than one cgroup). Making filesystems cgroup
> > aware would need redesign of journalling layer itself.
> > 
> > Dave Chinner thinks this problem should be solved and being solved in a
> > different manner by making the bdi-flusher writeback cgroup aware.
> > 
> > Greg Thelen's memcg writeback patchset (already been proposed for LSF/MM
> > summit this year) adds cgroup awareness to writeback. Some aspects of
> > this patchset could be borrowed for solving the problem of throttling
> > buffered writes.
> > 
> > As I understand the topic was discussed during last Kernel Summit as
> > well and the idea is to get the IO-less throttling patchset into the
> > kernel, then do per-memcg dirty memory limiting and add some memcg
> > awareness to writeback Greg Thelen and then when these things settle
> > down, think how to solve this problem since noone really seem to have a
> > good answer to it.
> > 
> > Having worked on linux filesystem/storage area for a few years now and
> > having spent time understanding the various approaches tried and looked
> > at other feasible way of solving this problem, I look forward to
> > participate in the summit and discussions.
> > 
> > So, the topic I would like to discuss is solving the problem of
> > "throttling buffered writes". This could considered for discussion with
> > memcg writeback session if that topic has been allocated a slot.
> > 
> > I'm aware that this is a late submission and my apologies for not making
> > it earlier. But, I want to take chances and see if it is possible still..
> 
> This is an interesting and complicated topic. As you mentioned we have had
> tried to solve it but nothing has been merged yet. Personally, I am still
> interested in having a discussion and see if we can come up with a way
> forward.

I'm interested, too. Here is my attempt on the problem a year ago:

blk-cgroup: async write IO controller ("buffered write" would be more precise)
https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d
https://lkml.org/lkml/2011/4/4/205

> Because filesystems are not cgroup aware, throtting IO below filesystem
> has dangers of IO of faster cgroups being throttled behind slower cgroup
> (journalling was one example and there could be others). Hence, I personally
> think that this problem should be solved at higher layer and that is when
> we are actually writting to the cache. That has the disadvantage of still
> seeing IO spikes at the device but I guess we live with that. Doing it
> at higher layer also allows to use the same logic for NFS too otherwise
> NFS buffered write will continue to be a problem.

Totally agreed.

> In case of memory controller it jsut becomes a write to memory issue,
> and not sure if notion of dirty_ratio and dirty_bytes is enough or we 
> need to rate limit the write to memory. 

In a perfect world, the dirty size and rate may be each balanced
around their targets. Ideally we could independently limit dirty size
in memcg context and limit dirty rate in blkcg. If the user want to
control both size/rate, he may put tasks into memcg as well as blkcg.

In reality the dirty size limit will impact the dirty rate, because
memcg needs to adjust its tasks' balanced dirty rate to drive the memcg
dirty size to the target, so does the global dirty target. Comparing to
the global dirty size balancing, memcg suffers from a unique problem: 
given N memcg each running a dd task, each memcg's dirty size will be
dropping suddenly on every (N/2) seconds. Because the flusher writeout
the inodes in coarse time-split round-robin fashion, with up to
(bdi->write_bandwidth/2) chunk size. That sudden drop of memcg dirty
pages may drive the dirty size far from the target, as a result it
will need to adjust the dirty rate heavily in order to drive the dirty
size back to the target. So the memcg dirty size balance may create
large fluctuations in the dirty rates, and even long stall time of the
memcg tasks. What's more, due to the uncontrollable way the flusher
walks through the dirty pages and how the dirty pages distribute among
the dirty inodes and memcgs, the dirty rate will be impacted heavily
by the workload and behavior of the flusher when enforcing the dirty
size target.  There are no satisfactory solution to this till now.
Currently I'm trying to shun away from this and look into improving
the page reclaim so that it can work well with LRU lists with half
pages being dirty/writeback. Then the 20% global dirty limit should be
enough to serve most memcg tasks well taking into account the unevenly
distributed dirty pages among different memcg and NUMA zones/nodes.
There may still be few memcgs that need further dirty throttling, but
they are likely mainly consist of heavy dirtiers and can afford less
smoothness and longer delays.

In comparison, the dirty rate limit for buffered writes seems less
convolved to me. It sure has its own problems, so we see several
solutions in circular, each with its unique trade offs. But at least
we have relative simple solutions that work to their design goals.

> Anyway, ideas to have better control of write rates are welcome. We have
> seen issues wheren a virtual machine cloning operation is going on and
> we also want a small direct write to be on disk and it can take a long
> time with deadline. CFQ should still be fine as direct IO is synchronous
> but deadline treats all WRITEs the same way.
> 
> May be deadline should be modified to differentiate between SYNC and ASYNC
> IO instead of READ/WRITE. Jens?

In general users definitely need higher priorities for SYNC writes. It
will also enable the "buffered write I/O controller" and "direct write
I/O controller" to co-exist well and operate independently this way:
the direct writes always enjoy higher priority than the flusher, but
will be rate limited by the already upstreamed blk-cgroup I/O
controller. The remaining disk bandwidth will be split among the
buffered write tasks by another I/O controller operating at the
balance_dirty_pages() level.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html