Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Vivek Goyal <vgoyal@xxxxxxxxxx> · Wed, 20 Apr 2011 14:44:33 -0400

On Wed, Apr 20, 2011 at 09:16:38AM +0800, Wu Fengguang wrote:
> On Wed, Apr 20, 2011 at 01:05:43AM +0800, Vivek Goyal wrote:
> > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > [snip]
> > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > 
> > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > 
> > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > 
> > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > 
> > > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > > 
> > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > 
> > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > which are already in cache.
> > > > > > > > 
> > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > faults, splice, read, ...).
> > > > > > 
> > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > I might be all set.
> > > > > 
> > > > > Basically all data reads go through the readahead layer, and the
> > > > > __do_page_cache_readahead() function.
> > > > > 
> > > > > Just one more option for your tradeoffs :)
> > > > 
> > > > But this does not cover direct IO?
> > > 
> > > Yes, sorry!
> > > 
> > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > to mark meta data READS. I will look into it.
> > > 
> > > Right, and the hooks should be trivial to add.
> > > 
> > > The readahead code is typically invoked in three ways:
> > > 
> > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > 
> > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > >   => page_cache_async_readahead()
> > > 
> > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > 
> > > ext3/4 also call into readahead on readdir().
> > 
> > So this will be called for even meta data READS. Then there is no
> > advantage of moving the throttle hook out of generic_make_request()?
> > Instead what I will need is that ask file systems to mark meta data
> > IO so that I can avoid throttling.
> 
> Do you want to avoid meta data itself, or to avoid overall performance
> being impacted as a result of meta data read throttling?

I wanted to avoid throttling metadata beacause it might lead to reduced
overall performance due to dependencies in file system layer.

> 
> Either way, you have the freedom to test whether the passed filp is a
> normal file or a directory "file", and do conditional throttling.

Ok, will look into it. That will probably take care of READS. What 
about WRITES and meta data. Is it safe to assume that any meta data
write will come in some jounalling thread context and not in user 
process context?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html