On Wed, Apr 04, 2012 at 02:35:29PM -0400, Vivek Goyal wrote: > On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote: > > [..] > > The sweet split point would be for balance_dirty_pages() to do cgroup > > aware buffered write throttling and leave other IOs to the current > > blkcg. For this to work well as a total solution for end users, I hope > > we can cooperate and figure out ways for the two throttling entities > > to work well with each other. > > Throttling read + direct IO, higher up has few issues too. Users will Yeah I have a bit worry about high layer throttling, too. Anyway here are the ideas. > not like that a task got blocked as it tried to submit a read from a > throttled group. That's not the same issue I worried about :) Throttling is about inserting small sleep/waits into selected points. For reads, the ideal sleep point is immediately after readahead IO is summited, at the end of __do_page_cache_readahead(). The same should be applicable to direct IO. > Current async behavior works well where we queue up the > bio from the task in throttled group and let task do other things. Same > is true for AIO where we would not like to block in bio submission. For AIO, we'll need to delay the IO completion notification or status update, which may involve computing some delay time and delay the calls to io_complete() with the help of some delayed work queue. There may be more issues to deal with as I didn't look into aio.c carefully. The thing worried me is that in the proportional throttling case, the high level throttling works on the *estimated* task_ratelimit = disk_bandwidth / N, where N is the number of read IO tasks. When N suddenly changes from 2 to 1, it may take 1 second for the estimated task_ratelimit to adapt from disk_bandwidth/2 up to disk_bandwidth, during which time the disk won't get 100% utilized because of the temporally over-throttling of the remaining IO task. This is not a problem when throttling at the block/cfq layer, since it has the full information of pending requests and should not depend on such estimations. The workaround I can think of, is to put the throttled task into a wait queue, and let block layer wake up the waiters when the IO queue runs empty. This should be able to avoid most disk idle time. Thanks, Fengguang _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers