Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Vivek Goyal <vgoyal@xxxxxxxxxx> · Fri, 15 Apr 2011 17:07:50 -0400

On Mon, Apr 11, 2011 at 11:36:30AM +1000, Dave Chinner wrote:

[..]
> > > > > how metadata IO is going to be handled by
> > > > > IO controllers,
> > > > 
> > > > So IO controller provides two mechanisms.
> > > > 
> > > > - IO throttling(bytes_per_second, io_per_second interface)
> > > > - Proportional weight disk sharing
> > > > 
> > > > In case of proportional weight disk sharing, we don't run into issues of
> > > > priority inversion and metadata handing should not be a concern.
> > > 
> > > Though metadata IO will affect how much bandwidth/iops is available
> > > for applications to use.
> > 
> > I think meta data IO will be accounted to the process submitting the meta
> > data IO. (IO tracking stuff will be used only for page cache pages during
> > page dirtying time). So yes, the process doing meta data IO will be
> > charged for it. 
> > 
> > I think I am missing something here and not understanding your concern
> > exactly here.
> 
> XFS can issue thousands of delayed metadata write IO per second from
> it's writeback threads when it needs to (e.g. tail pushing the
> journal).  Completely unthrottled due to the context they are issued
> from(*) and can basically consume all the disk iops and bandwidth
> capacity for seconds at a time. 
> 
> Also, XFS doesn't use the page cache for metadata buffers anymore
> so page cache accounting, throttling and reclaim mechanisms
> are never going to work for controlling XFS metadata IO
> 
> 
> (*) It'll be IO issued by workqueues rather than threads RSN:
> 
> http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39
> 
> And this will become _much_ more common in the not-to-distant
> future. So context passing between threads and to workqueues is
> something you need to think about sooner rather than later if you
> want metadata IO to be throttled in any way....

Ok,

So this seems to the similar case as WRITE traffic from flusher threads
which can disrupt IO on end device even if we have done throttling in
balance_dirty_pages().

How about doing throttling at two layers. All the data throttling is
done in higher layers and then also retain the mechanism of throttling
at end device. That way an admin can put a overall limit on such 
common write traffic. (XFS meta data coming from workqueues, flusher
thread, kswapd etc).

Anyway, we can't attribute this IO to per process context/group otherwise
most likely something will get serialized in higher layers.

Right now I am speaking purely from IO throttling point of view and not
even thinking about CFQ and IO tracking stuff.

This increases the complexity in IO cgroup interface as now we see to have
four combinations.

  Global Throttling
  	Throttling at lower layers
  	Throttling at higher layers.

  Per device throttling
 	 Throttling at lower layers
  	Throttling at higher layers.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html