On Thu, Apr 21, 2011 at 02:44:33AM +0800, Vivek Goyal wrote: > On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote: > > On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote: > > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote: > > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote: > > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote: > > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote: > > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote: > > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote: > > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote: > > > > > > > > > > [snip] > > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple > > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted > > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should > > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers, > > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early. > > > > > > > > > > > > > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that > > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at > > > > > > > > > > > > > that point, and the impact on application performance could be huge > > > > > > > > > > > > > (i.e. MB/s instead of GB/s). > > > > > > > > > > > > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine > > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly. > > > > > > > > > > > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no > > > > > > > > > > > context of whether a read or a write is in progress at that point. > > > > > > > > > > > > > > > > > > > > I'm confused. Where is the throttling at cache hits? > > > > > > > > > > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and > > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page() > > > > > > > > > > and __do_fault() will explicitly call > > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling. > > > > > > > > > > > > > > > > > > This comment was in the context of what if we move block IO controller read > > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads > > > > > > > > > which are already in cache. > > > > > > > > > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in > > > > > > > > > only if data is not present in page cache and actual disk IO is initiated. > > > > > > > > You can always throttle in readpage(). It's not much higher than > > > > > > > > generic_make_request() but basically as high as it can get I suspect > > > > > > > > (otherwise you'd have to deal with lots of different code paths like page > > > > > > > > faults, splice, read, ...). > > > > > > > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. > > > > > > > The only thing generic_make_request() does not catch is network file > > > > > > > systems. I think for that I can introduce another hook say in NFS and > > > > > > > I might be all set. > > > > > > > > > > > > Basically all data reads go through the readahead layer, and the > > > > > > __do_page_cache_readahead() function. > > > > > > > > > > > > Just one more option for your tradeoffs :) > > > > > > > > > > But this does not cover direct IO? > > > > > > > > Yes, sorry! > > > > > > > > > But I guess if I split the hook into two parts (one in direct IO path > > > > > and one in __do_page_cache_readahead()), then filesystems don't have > > > > > to mark meta data READS. I will look into it. > > > > > > > > Right, and the hooks should be trivial to add. > > > > > > > > The readahead code is typically invoked in three ways: > > > > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead() > > > > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window), > > > > => page_cache_async_readahead() > > > > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead() > > > > > > > > ext3/4 also call into readahead on readdir(). > > > > > > So this will be called for even meta data READS. Then there is no > > > advantage of moving the throttle hook out of generic_make_request()? > > > Instead what I will need is that ask file systems to mark meta data > > > IO so that I can avoid throttling. > > > > Do you want to avoid meta data itself, or to avoid overall performance > > being impacted as a result of meta data read throttling? > > I wanted to avoid throttling metadata beacause it might lead to reduced > overall performance due to dependencies in file system layer. You can get meta data "throttling" and performance at the same time. See below ideas. > > > > Either way, you have the freedom to test whether the passed filp is a > > normal file or a directory "file", and do conditional throttling. > > Ok, will look into it. That will probably take care of READS. What > about WRITES and meta data. Is it safe to assume that any meta data > write will come in some jounalling thread context and not in user > process context? It's very possible to throttle meta data READS/WRITES, as long as they can be attributed to the original task (assuming task oriented throttling instead of bio/request oriented). The trick is to separate the concepts of THROTTLING and ACCOUNTING. You can ACCOUNT data and meta data reads/writes to the right task, and only to THROTTLE the task when it's doing data reads/writes. FYI I played the same trick for balance_dirty_pages_ratelimited() for another reason: _accurate_ accounting of dirtied pages. That trick should play well with most applications who do interleaved data and meta data reads/writes. For the special case of "find" who does pure meta data reads, we can still throttle it by playing another trick: to THROTTLE meta data reads/writes with a much higher threshold than that of data. So normal applications will be almost always be throttled at data accesses while "find" will be throttled at meta data accesses. For a real example of how it works, you can check this patch (plus the attached one) writeback: IO-less balance_dirty_pages() http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556 Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause is the threshold for THROTTLING. When tsk->nr_dirtied > tsk->nr_dirtied_pause The task will voluntarily enter balance_dirty_pages() for taking a nap (pause time will be proportional to tsk->nr_dirtied), and when finished, start a new account-and-throttle period by resetting tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more reasonable pause time at next sleep. BTW, I'd like to advocate balance_dirty_pages() based IO controller :) As you may have noticed, it's not all that hard: the main functions blkcg_update_bandwidth()/blkcg_update_dirty_ratelimit() can fit nicely in one screen! writeback: async write IO controllers http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hp=5b6fcb3125ea52ff04a2fad27a51307842deb1a0 Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html