Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 11 Apr 2011 11:36:30 +1000

On Thu, Apr 07, 2011 at 01:55:37PM -0400, Vivek Goyal wrote:
> On Thu, Apr 07, 2011 at 09:50:39AM +1000, Dave Chinner wrote:
> > On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
> > > On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote:
> > > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> > > > It also means you will have handle the case of a cgroup over a
> > > > throttle limit and no inodes on it's dirty list. It's not a case of
> > > > "probably can live with" the resultant mess, the mess will occur and
> > > > so handling it needs to be designed in from the start.
> > > 
> > > This behavior can happen due to shared page accounting. One possible
> > > way to mitigate this problme is to traverse through LRU list of pages
> > > of memcg and find an inode to do the writebak. 
> > 
> > Page LRU ordered writeback is something we need to avoid. It causes
> > havok with IO and allocation patterns. Also, how expensive is such a
> > walk? If it's a common operation, then it's a non-starter for the
> > generic writeback code.
> > 
> 
> Agreed that LRU ordered writeback needs to be avoided as it is going to be
> expensive. That's why the notion of inode on per memcg list and do per inode
> writeback. This seems to be only backup plan in case when there is no
> inode to do IO due to shared inode accounting issues. 

This shouldn't be hidden inside memcg relaim - memcg reclaim should
do exactly what the MM subsystem normal does without memcg being in
the picture.  That is, you need to convince the MM guys to change
the way reclaim does writeback from the LRU. We've been asking them
to do this for years....

> Do you have ideas on better way to handle it? The other proposal of
> maintaining a list memcg_mapping, which tracks which inode this cgroup
> has dirtied has been deemed complex and been kind of rejected at least
> for the first step.

Fix the mm subsystem to DTRT first?

> 
> > BTW, how is "shared page accounting" different to the shared dirty
> > inode case we've been discussing?
> 
> IIUC, there are two problems.
> 
> - Issues because of shared page accounting
> - Issues because of shared inode accouting.
> 
> So in shared page accounting, if two process do IO to same page, IO gets
> charged to cgroup who first touched the page. So if a cgroup is writting
> on lots of shared pages, it will be charged to the other cgroup who 
> brought the page in memory to begin with and will drive its dirty ratio
> up. So this seems to be case of weaker isolation in case of shared pages,
> and we got to live with it.
> 
> Similarly if inode is shared, inode gets put on the list of memcg who dirtied
> it first. So now if two cgroups are dirtying pages on inode, then pages should
> be charged to respective cgroup but inode will be only on one memcg and once
> writeback is performed it might happen that cgroup is over its background
> limit but there are no inodes to do writeback.
> 
> > 
> > > After yesterday's discussion it looked like people agreed that to 
> > > begin with keep it simple and maintain the notion of one inode on
> > > one memcg list. So instead of inode being on global bdi dirty list
> > > it will be on per memecg per bdi dirty list.
> > 
> > Good to hear.
> > 
> > > > how metadata IO is going to be handled by
> > > > IO controllers,
> > > 
> > > So IO controller provides two mechanisms.
> > > 
> > > - IO throttling(bytes_per_second, io_per_second interface)
> > > - Proportional weight disk sharing
> > > 
> > > In case of proportional weight disk sharing, we don't run into issues of
> > > priority inversion and metadata handing should not be a concern.
> > 
> > Though metadata IO will affect how much bandwidth/iops is available
> > for applications to use.
> 
> I think meta data IO will be accounted to the process submitting the meta
> data IO. (IO tracking stuff will be used only for page cache pages during
> page dirtying time). So yes, the process doing meta data IO will be
> charged for it. 
> 
> I think I am missing something here and not understanding your concern
> exactly here.

XFS can issue thousands of delayed metadata write IO per second from
it's writeback threads when it needs to (e.g. tail pushing the
journal).  Completely unthrottled due to the context they are issued
from(*) and can basically consume all the disk iops and bandwidth
capacity for seconds at a time. 

Also, XFS doesn't use the page cache for metadata buffers anymore
so page cache accounting, throttling and reclaim mechanisms
are never going to work for controlling XFS metadata IO

(*) It'll be IO issued by workqueues rather than threads RSN:

http://git.kernel.org/?p=linux/kernel/git/dgc/xfsdev.git;a=shortlog;h=refs/heads/xfs-for-2.6.39

And this will become _much_ more common in the not-to-distant
future. So context passing between threads and to workqueues is
something you need to think about sooner rather than later if you
want metadata IO to be throttled in any way....

> > > For throttling case, apart from metadata, I found that with simple
> > > throttling of data I ran into issues with journalling with ext4 mounuted
> > > in ordered mode. So it was suggested that WRITE IO throttling should
> > > not be done at device level instead try to do it in higher layers,
> > > possibly balance_dirty_pages() and throttle process early.
> > 
> > The problem with doing it at the page cache entry level is that
> > cache hits then get throttled. It's not really a an IO controller at
> > that point, and the impact on application performance could be huge
> > (i.e. MB/s instead of GB/s).
> 
> Agreed that throttling cache hits is not a good idea. Can we determine
> if page being asked for is in cache or not and charge for IO accordingly.

You'd need hooks in find_or_create_page(), though you have no
context of whether a read or a write is in progress at that point.

> > > So yes, I agree that little more documentation and more clarity on this
> > > would be good. All this cgroup aware writeback primarily is being done
> > > for CFQ's proportional disk sharing at the moment.
> > > 
> > > > what kswapd is going to do writeback when the pages
> > > > it's trying to writeback during a critical low memory event belong
> > > > to a cgroup that is throttled at the IO level, etc.
> > > 
> > > Throttling will move up so kswapd will not be throttled. Even today,
> > > kswapd is part of root group and we do not suggest throttling root group.
> > 
> > So once again you have the problem of writeback from kswapd (which
> > is ugly to begin with) affecting all the groups. Given kswapd likes
> > to issue what is effectively random IO, this coul dhave devastating
> > effect on everything else....
> 
> Implementing throttling at higher layer has the problem of IO spikes
> at the end level devices when flusher or kswapd decide to do bunch of
> IO. I really don't have a good answer for that. Doing throttling at
> device level runs into issues with journalling. So I guess issues of
> IO spikes is lesser concern as compared to issue of choking filesystem.
> 
> Following two things might help though a bit with IO spikes.
> 
> - Keep per cgroup background dirty ratio low so that flusher tries to
>   flush out pages sooner than later.

Which has major performance impacts.
> 
> - All the IO coming from flusher/kswapd will be going in root group
>   from throttling perspective. We can try to throttle it again to
>   some reasonable value to reduce the impact of IO spikes. 

Don't do writeback from kswapd at all? Push it all to the flusher
thread which has a context to work from?

> > > For the case of proportional disk sharing, we will probably account
> > > IO to respective cgroups (pages submitted by kswapd) and that should
> > > not flush to disk fairly fast and should not block for long time as it is
> > > work consering mechanism.
> > 
> > Well, it depends. I can still see how, with proportional IO, kswapd
> > would get slowed cleaning dirty pages on one memcg when there are
> > clean pages in another memcg that it could reclaim without doing any
> > IO. i.e. it has potential to slow down memory reclaim significantly.
> > (Note, I'm assuming proportional IO doesn't mean "no throttling" it
> > just means there is a much lower delay on IO issue.)
> 
> Proportional IO can delay submitting an IO only if there is IO happening
> in other groups. So IO can still be throttled and limits are decided
> by fair share of a group. But if other groups are not doing IO and not
> using their fair share, then the group doing IO gets bigger share.
> 
> So yes, if heavy IO is happening at disk while kswapd is also trying
> to reclaim memory, then IO submitted by kswapd can be delayed and
> this can slow down reclaim. (Does kswapd has to block after submitting
> IO from a memcg. Can't it just move onto next memcg and either free
> pages if not dirty, or also submit IO from next memcg?)

No idea - you'll need to engage the mm guys to get help there.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html