On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote: > On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote: > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: > > It also means you will have handle the case of a cgroup over a > > throttle limit and no inodes on it's dirty list. It's not a case of > > "probably can live with" the resultant mess, the mess will occur and > > so handling it needs to be designed in from the start. > > This behavior can happen due to shared page accounting. One possible > way to mitigate this problme is to traverse through LRU list of pages > of memcg and find an inode to do the writebak. Page LRU ordered writeback is something we need to avoid. It causes havok with IO and allocation patterns. Also, how expensive is such a walk? If it's a common operation, then it's a non-starter for the generic writeback code. BTW, how is "shared page accounting" different to the shared dirty inode case we've been discussing? > After yesterday's discussion it looked like people agreed that to > begin with keep it simple and maintain the notion of one inode on > one memcg list. So instead of inode being on global bdi dirty list > it will be on per memecg per bdi dirty list. Good to hear. > > how metadata IO is going to be handled by > > IO controllers, > > So IO controller provides two mechanisms. > > - IO throttling(bytes_per_second, io_per_second interface) > - Proportional weight disk sharing > > In case of proportional weight disk sharing, we don't run into issues of > priority inversion and metadata handing should not be a concern. Though metadata IO will affect how much bandwidth/iops is available for applications to use. > For throttling case, apart from metadata, I found that with simple > throttling of data I ran into issues with journalling with ext4 mounuted > in ordered mode. So it was suggested that WRITE IO throttling should > not be done at device level instead try to do it in higher layers, > possibly balance_dirty_pages() and throttle process early. The problem with doing it at the page cache entry level is that cache hits then get throttled. It's not really a an IO controller at that point, and the impact on application performance could be huge (i.e. MB/s instead of GB/s). > So yes, I agree that little more documentation and more clarity on this > would be good. All this cgroup aware writeback primarily is being done > for CFQ's proportional disk sharing at the moment. > > > what kswapd is going to do writeback when the pages > > it's trying to writeback during a critical low memory event belong > > to a cgroup that is throttled at the IO level, etc. > > Throttling will move up so kswapd will not be throttled. Even today, > kswapd is part of root group and we do not suggest throttling root group. So once again you have the problem of writeback from kswapd (which is ugly to begin with) affecting all the groups. Given kswapd likes to issue what is effectively random IO, this coul dhave devastating effect on everything else.... > For the case of proportional disk sharing, we will probably account > IO to respective cgroups (pages submitted by kswapd) and that should > not flush to disk fairly fast and should not block for long time as it is > work consering mechanism. Well, it depends. I can still see how, with proportional IO, kswapd would get slowed cleaning dirty pages on one memcg when there are clean pages in another memcg that it could reclaim without doing any IO. i.e. it has potential to slow down memory reclaim significantly. (Note, I'm assuming proportional IO doesn't mean "no throttling" it just means there is a much lower delay on IO issue.) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html