On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote: [..] > > > > > how metadata IO is going to be handled by > > > > > IO controllers, > > > > > > > > So IO controller provides two mechanisms. > > > > > > > > - IO throttling(bytes_per_second, io_per_second interface) > > > > - Proportional weight disk sharing > > > > > > > > In case of proportional weight disk sharing, we don't run into issues of > > > > priority inversion and metadata handing should not be a concern. > > > > > > Though metadata IO will affect how much bandwidth/iops is available > > > for applications to use. > > > > I think meta data IO will be accounted to the process submitting the meta > > data IO. (IO tracking stuff will be used only for page cache pages during > > page dirtying time). So yes, the process doing meta data IO will be > > charged for it. > > > > I think I am missing something here and not understanding your concern > > exactly here. > > XFS can issue thousands of delayed metadata write IO per second from > it's writeback threads when it needs to (e.g. tail pushing the > journal). Completely unthrottled due to the context they are issued > from(*) and can basically consume all the disk iops and bandwidth > capacity for seconds at a time. > > Also, XFS doesn't use the page cache for metadata buffers anymore > so page cache accounting, throttling and reclaim mechanisms > are never going to work for controlling XFS metadata IO > > > (*) It'll be IO issued by workqueues rather than threads RSN: > > http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39 > > And this will become _much_ more common in the not-to-distant > future. So context passing between threads and to workqueues is > something you need to think about sooner rather than later if you > want metadata IO to be throttled in any way.... Ok, So this seems to the similar case as WRITE traffic from flusher threads which can disrupt IO on end device even if we have done throttling in balance_dirty_pages(). How about doing throttling at two layers. All the data throttling is done in higher layers and then also retain the mechanism of throttling at end device. That way an admin can put a overall limit on such common write traffic. (XFS meta data coming from workqueues, flusher thread, kswapd etc). Anyway, we can't attribute this IO to per process context/group otherwise most likely something will get serialized in higher layers. Right now I am speaking purely from IO throttling point of view and not even thinking about CFQ and IO tracking stuff. This increases the complexity in IO cgroup interface as now we see to have four combinations. Global Throttling Throttling at lower layers Throttling at higher layers. Per device throttling Throttling at lower layers Throttling at higher layers. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html