Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Jan Kara <jack@xxxxxxx> · Wed, 20 Apr 2011 16:42:36 +0200



On Wed 20-04-11 19:19:57, Wu Fengguang wrote:
> On Wed, Apr 20, 2011 at 06:56:06PM +0800, Jan Kara wrote:
> > On Wed 20-04-11 09:21:31, Wu Fengguang wrote:
> > > On Wed, Apr 20, 2011 at 04:58:21AM +0800, Jan Kara wrote:
> > > > On Tue 19-04-11 13:05:43, Vivek Goyal wrote:
> > > > > On Wed, Apr 20, 2011 at 12:58:38AM +0800, Wu Fengguang wrote:
> > > > > > On Tue, Apr 19, 2011 at 11:31:06PM +0800, Vivek Goyal wrote:
> > > > > > > On Tue, Apr 19, 2011 at 11:22:40PM +0800, Wu Fengguang wrote:
> > > > > > > > On Tue, Apr 19, 2011 at 11:11:11PM +0800, Vivek Goyal wrote:
> > > > > > > > > On Tue, Apr 19, 2011 at 04:48:32PM +0200, Jan Kara wrote:
> > > > > > > > > > On Tue 19-04-11 10:34:23, Vivek Goyal wrote:
> > > > > > > > > > > On Tue, Apr 19, 2011 at 10:17:17PM +0800, Wu Fengguang wrote:
> > > > > > > > > > > > [snip]
> > > > > > > > > > > > > > > > For throttling case, apart from metadata, I found that with simple
> > > > > > > > > > > > > > > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > > > > > > > > > > > > > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > > > > > > > > > > > > > > not be done at device level instead try to do it in higher layers,
> > > > > > > > > > > > > > > > possibly balance_dirty_pages() and throttle process early.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The problem with doing it at the page cache entry level is that
> > > > > > > > > > > > > > > cache hits then get throttled. It's not really a an IO controller at
> > > > > > > > > > > > > > > that point, and the impact on application performance could be huge
> > > > > > > > > > > > > > > (i.e. MB/s instead of GB/s).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Agreed that throttling cache hits is not a good idea. Can we determine
> > > > > > > > > > > > > > if page being asked for is in cache or not and charge for IO accordingly.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > You'd need hooks in find_or_create_page(), though you have no
> > > > > > > > > > > > > context of whether a read or a write is in progress at that point.
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm confused.  Where is the throttling at cache hits?
> > > > > > > > > > > > 
> > > > > > > > > > > > The balance_dirty_pages() throttling kicks in at write() syscall and
> > > > > > > > > > > > page fault time. For example, generic_perform_write(), do_wp_page()
> > > > > > > > > > > > and __do_fault() will explicitly call
> > > > > > > > > > > > balance_dirty_pages_ratelimited() to do the write throttling.
> > > > > > > > > > > 
> > > > > > > > > > > This comment was in the context of what if we move block IO controller read
> > > > > > > > > > > throttling also in higher layers. Then we don't want to throttle reads
> > > > > > > > > > > which are already in cache.
> > > > > > > > > > > 
> > > > > > > > > > > Currently throttling hook is in generic_make_request() and it kicks in
> > > > > > > > > > > only if data is not present in page cache and actual disk IO is initiated.
> > > > > > > > > >   You can always throttle in readpage(). It's not much higher than
> > > > > > > > > > generic_make_request() but basically as high as it can get I suspect
> > > > > > > > > > (otherwise you'd have to deal with lots of different code paths like page
> > > > > > > > > > faults, splice, read, ...).
> > > > > > > > > 
> > > > > > > > > Yep, I was thinking that what do I gain by moving READ throttling up. 
> > > > > > > > > The only thing generic_make_request() does not catch is network file
> > > > > > > > > systems. I think for that I can introduce another hook say in NFS and
> > > > > > > > > I might be all set.
> > > > > > > > 
> > > > > > > > Basically all data reads go through the readahead layer, and the
> > > > > > > > __do_page_cache_readahead() function.
> > > > > > > > 
> > > > > > > > Just one more option for your tradeoffs :)
> > > > > > > 
> > > > > > > But this does not cover direct IO?
> > > > > > 
> > > > > > Yes, sorry!
> > > > > > 
> > > > > > > But I guess if I split the hook into two parts (one in direct IO path
> > > > > > > and one in __do_page_cache_readahead()), then filesystems don't have
> > > > > > > to mark meta data READS. I will look into it.
> > > > > > 
> > > > > > Right, and the hooks should be trivial to add.
> > > > > > 
> > > > > > The readahead code is typically invoked in three ways:
> > > > > > 
> > > > > > - sync readahead, on page cache miss, => page_cache_sync_readahead()
> > > > > > 
> > > > > > - async readahead, on hitting PG_readahead (tagged on one page per readahead window),
> > > > > >   => page_cache_async_readahead()
> > > > > > 
> > > > > > - user space readahead, fadvise(WILLNEED), => force_page_cache_readahead()
> > > > > > 
> > > > > > ext3/4 also call into readahead on readdir().
> > > > > 
> > > > > So this will be called for even meta data READS. Then there is no
> > > > > advantage of moving the throttle hook out of generic_make_request()?
> > > >   No, generally it won't. I think Fengguang was wrong - only ext2 carries
> > > > directories in page cache and thus uses readahead code. All other
> > > > filesystems handle directories specially and don't use readpage for them.
> > > 
> > > So ext2 is implicitly using readahead? ext3/4 behave different in that
> > > ext4_readdir() has an explicit call to page_cache_sync_readahead(),
> > > passing the blockdev mapping as the page cache container.
> >   Yes, ext2 uses implicitely readahead because it uses read_mapping_page()
> > for directory inodes. I forgot that ext3/4 call
> > page_cache_sync_readahead() so you were right that they actually use it for
> > the device inode. I'm sorry for the noise.
> 
> Never mind.  However I cannot find readahead calls in the
> read_mapping_page() call chain. ext2 readdir() may not be doing
> readahead at all...
> 
>         read_mapping_page()
>           read_cache_page()
>             read_cache_page_async()
>               do_read_cache_page()
>                 __read_cache_page()
  Right, I've now checked the real code and it would have to use
read_cache_pages() to have some readahead. I'm not completely sure where
did I get from that ext2 performs directory readahead - some papers about
ext2 I found in the Internet say so and I believe Andrew mentioned it as
well. But I cannot find a kernel where this would happen... So thanks for
correcting me :).

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html