On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote: > On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote: > > [..] > > > > > > how metadata IO is going to be handled by > > > > > > IO controllers, > > > > > > > > > > So IO controller provides two mechanisms. > > > > > > > > > > - IO throttling(bytes_per_second, io_per_second interface) > > > > > - Proportional weight disk sharing > > > > > > > > > > In case of proportional weight disk sharing, we don't run into issues of > > > > > priority inversion and metadata handing should not be a concern. > > > > > > > > Though metadata IO will affect how much bandwidth/iops is available > > > > for applications to use. > > > > > > I think meta data IO will be accounted to the process submitting the meta > > > data IO. (IO tracking stuff will be used only for page cache pages during > > > page dirtying time). So yes, the process doing meta data IO will be > > > charged for it. > > > > > > I think I am missing something here and not understanding your concern > > > exactly here. > > > > XFS can issue thousands of delayed metadata write IO per second from > > it's writeback threads when it needs to (e.g. tail pushing the > > journal). Completely unthrottled due to the context they are issued > > from(*) and can basically consume all the disk iops and bandwidth > > capacity for seconds at a time. > > > > Also, XFS doesn't use the page cache for metadata buffers anymore > > so page cache accounting, throttling and reclaim mechanisms > > are never going to work for controlling XFS metadata IO > > > > > > (*) It'll be IO issued by workqueues rather than threads RSN: > > > > http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39 > > > > And this will become _much_ more common in the not-to-distant > > future. So context passing between threads and to workqueues is > > something you need to think about sooner rather than later if you > > want metadata IO to be throttled in any way.... > > Ok, > > So this seems to the similar case as WRITE traffic from flusher threads > which can disrupt IO on end device even if we have done throttling in > balance_dirty_pages(). > > How about doing throttling at two layers. All the data throttling is > done in higher layers and then also retain the mechanism of throttling > at end device. That way an admin can put a overall limit on such > common write traffic. (XFS meta data coming from workqueues, flusher > thread, kswapd etc). > > Anyway, we can't attribute this IO to per process context/group otherwise > most likely something will get serialized in higher layers. > > Right now I am speaking purely from IO throttling point of view and not > even thinking about CFQ and IO tracking stuff. > > This increases the complexity in IO cgroup interface as now we see to have > four combinations. > > Global Throttling > Throttling at lower layers > Throttling at higher layers. > > Per device throttling > Throttling at lower layers > Throttling at higher layers. Dave, I wrote above but I myself am not fond of coming up with 4 combinations. Want to limit it two. Per device throttling or global throttling. Here are some more thoughts in general about both throttling policy and proportional policy of IO controller. For throttling policy, I am primarily concerned with how to avoid file system serialization issues. Proportional IO (CFQ) --------------------- - Make writeback cgroup aware and kernel threads (flusher) which are cgroup aware can be marked with a task flag (GROUP_AWARE). If a cgroup aware kernel threads throws IO at CFQ, then IO is accounted to cgroup of task who originally dirtied the page. Otherwise we use task context to account the IO to. So any IO submitted by flusher threads will go to respective cgroups and higher weight cgroup should be able to do more WRITES. IO submitted by other kernel threads like kjournald, XFS async metadata submission, kswapd etc all goes to thread context and that is root group. - If kswapd is a concern then either make kswapd cgroup aware or let kswapd use cgroup aware flusher to do IO (Dave Chinner's idea). Open Issues ----------- - We do not get isolation for meta data IO. In virtualized setup, to achieve stronger isolation do not use host filesystem. Export block devices into guests. IO throttling ------------ READS ----- - Do not throttle meta data IO. Filesystem needs to mark READ metadata IO so that we can avoid throttling it. This way ordered filesystems will not get serialized behind a throttled read in slow group. May be one can account meta data read to a group and try to use that to throttle data IO in same cgroup as a compensation. WRITES ------ - Throttle tasks. Do not throttle bios. That means that when a task submits direct write, let it go to disk. Do the accounting and if task is exceeding the IO rate make it sleep. Something similar to balance_dirty_pages(). That way, any direct WRITES should not run into any serialization issues in ordered mode. We can continue to use blkio_throtle_bio() hook in generic_make request(). - For buffered WRITES, design a throttling hook similar to balance_drity_pages() and throttle tasks according to rules while they are dirtying page cache. - Do not throttle buffered writes again at the end device as these have been throttled already while writting to page cache. Also throttling WRITES at end device will lead to serialization issues with file systems in ordered mode. - Cgroup of a IO is always attributed to submitting thread. That way all meta data writes will go in root cgroup and remain unthrottled. If one is too concerned with lots of meta data IO, then probably one can put a throttling rule in root cgroup. Open Issues ----------- - IO spikes at end devices Because buffered writes are controlled at page dirtying time, we can have a spike of IO later at end device when flusher thread decides to do writeback. I am not sure how to solve this issue. Part of the problem can be handled by using per cgroup dirty ratio and keeping each cgroup's ratio low so that we don't build up huge dirty caches. This can lead to performance drop of applications. So this is performance vs isolation trade off and user chooses one. This issue exists in virtualized environment only if host file system is used. The best way to achieve maximum isolation would be to export block devices into guest and then perform throttling per block device. - Poor isolation for meta data. We can't account and throttle meta data in each cgroup otherwise we should again run into file system serialization issues in ordered mode. So this is a trade off of using file systems. You primarily get throttling for data IO and not meta data IO. Again, export block devices in virtual machines and create file systems on that and do not use host filesystem and one can achieve a very good isolation. Thoughts? Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html