Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Vivek Goyal <vgoyal@xxxxxxxxxx> · Fri, 15 Apr 2011 23:06:02 -0400

On Fri, Apr 15, 2011 at 05:07:50PM -0400, Vivek Goyal wrote:
> On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote:
> 
> [..]
> > > > > > how metadata IO is going to be handled by
> > > > > > IO controllers,
> > > > > 
> > > > > So IO controller provides two mechanisms.
> > > > > 
> > > > > - IO throttling(bytes_per_second, io_per_second interface)
> > > > > - Proportional weight disk sharing
> > > > > 
> > > > > In case of proportional weight disk sharing, we don't run into issues of
> > > > > priority inversion and metadata handing should not be a concern.
> > > > 
> > > > Though metadata IO will affect how much bandwidth/iops is available
> > > > for applications to use.
> > > 
> > > I think meta data IO will be accounted to the process submitting the meta
> > > data IO. (IO tracking stuff will be used only for page cache pages during
> > > page dirtying time). So yes, the process doing meta data IO will be
> > > charged for it. 
> > > 
> > > I think I am missing something here and not understanding your concern
> > > exactly here.
> > 
> > XFS can issue thousands of delayed metadata write IO per second from
> > it's writeback threads when it needs to (e.g. tail pushing the
> > journal).  Completely unthrottled due to the context they are issued
> > from(*) and can basically consume all the disk iops and bandwidth
> > capacity for seconds at a time. 
> > 
> > Also, XFS doesn't use the page cache for metadata buffers anymore
> > so page cache accounting, throttling and reclaim mechanisms
> > are never going to work for controlling XFS metadata IO
> > 
> > 
> > (*) It'll be IO issued by workqueues rather than threads RSN:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39
> > 
> > And this will become _much_ more common in the not-to-distant
> > future. So context passing between threads and to workqueues is
> > something you need to think about sooner rather than later if you
> > want metadata IO to be throttled in any way....
> 
> Ok,
> 
> So this seems to the similar case as WRITE traffic from flusher threads
> which can disrupt IO on end device even if we have done throttling in
> balance_dirty_pages().
> 
> How about doing throttling at two layers. All the data throttling is
> done in higher layers and then also retain the mechanism of throttling
> at end device. That way an admin can put a overall limit on such 
> common write traffic. (XFS meta data coming from workqueues, flusher
> thread, kswapd etc).
> 
> Anyway, we can't attribute this IO to per process context/group otherwise
> most likely something will get serialized in higher layers.
>  
> Right now I am speaking purely from IO throttling point of view and not
> even thinking about CFQ and IO tracking stuff.
> 
> This increases the complexity in IO cgroup interface as now we see to have
> four combinations.
> 
>   Global Throttling
>   	Throttling at lower layers
>   	Throttling at higher layers.
> 
>   Per device throttling
>  	 Throttling at lower layers
>   	Throttling at higher layers.

Dave, 

I wrote above but I myself am not fond of coming up with 4 combinations.
Want to limit it two. Per device throttling or global throttling. Here
are some more thoughts in general about both throttling policy and
proportional policy of IO controller. For throttling policy, I am 
primarily concerned with how to avoid file system serialization issues.

Proportional IO (CFQ)
---------------------
- Make writeback cgroup aware and kernel threads (flusher) which are
  cgroup aware can be marked with a task flag (GROUP_AWARE). If a 
  cgroup aware kernel threads throws IO at CFQ, then IO is accounted
  to cgroup of task who originally dirtied the page. Otherwise we use
  task context to account the IO to.

  So any IO submitted by flusher threads will go to respective cgroups
  and higher weight cgroup should be able to do more WRITES.

  IO submitted by other kernel threads like kjournald, XFS async metadata
  submission, kswapd etc all goes to thread context and that is root
  group.

- If kswapd is a concern then either make kswapd cgroup aware or let
  kswapd use cgroup aware flusher to do IO (Dave Chinner's idea).

Open Issues
-----------
- We do not get isolation for meta data IO. In virtualized setup, to
  achieve stronger isolation do not use host filesystem. Export block
  devices into guests.

IO throttling
------------

READS
-----
- Do not throttle meta data IO. Filesystem needs to mark READ metadata
  IO so that we can avoid throttling it. This way ordered filesystems
  will not get serialized behind a throttled read in slow group.

  May be one can account meta data read to a group and try to use that
  to throttle data IO in same cgroup as a compensation.

WRITES
------
- Throttle tasks. Do not throttle bios. That means that when a task
  submits direct write, let it go to disk. Do the accounting and if task
  is exceeding the IO rate make it sleep. Something similar to
  balance_dirty_pages().

  That way, any direct WRITES should not run into any serialization issues
  in ordered mode. We can continue to use blkio_throtle_bio() hook in
  generic_make request().

- For buffered WRITES, design a throttling hook similar to
  balance_drity_pages() and throttle tasks according to rules while they
  are dirtying page cache.

- Do not throttle buffered writes again at the end device as these have
  been throttled already while writting to page cache. Also throttling
  WRITES at end device will lead to serialization issues with file systems
  in ordered mode.

- Cgroup of a IO is always attributed to submitting thread. That way all
  meta data writes will go in root cgroup and remain unthrottled. If one
  is too concerned with lots of meta data IO, then probably one can
  put a throttling rule in root cgroup.

Open Issues
-----------
- IO spikes at end devices

  Because buffered writes are controlled at page dirtying time, we can 
  have a spike of IO later at end device when flusher thread decides to
  do writeback. 

  I am not sure how to solve this issue. Part of the problem can be
  handled by using per cgroup dirty ratio and keeping each cgroup's
  ratio low so that we don't build up huge dirty caches. This can lead
  to performance drop of applications. So this is performance vs isolation
  trade off and user chooses one.

  This issue exists in virtualized environment only if host file system
  is used. The best way to achieve maximum isolation would be to export
  block devices into guest and then perform throttling per block device.

- Poor isolation for meta data.

  We can't account and throttle meta data in each cgroup otherwise we
  should again run into file system serialization issues in ordered
  mode. So this is a trade off of using file systems. You primarily get
  throttling for data IO and not meta data IO. 

  Again, export block devices in virtual machines and create file systems
  on that and do not use host filesystem and one can achieve a very good
  isolation.

Thoughts?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html