Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Vivek Goyal <vgoyal@xxxxxxxxxx> · Thu, 7 Apr 2011 13:55:37 -0400

On Thu, Apr 07, 2011 at 09:50:39AM +1000, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 11:37:15AM -0400, Vivek Goyal wrote:
> > On Wed, Apr 06, 2011 at 08:56:40AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote:
> > > It also means you will have handle the case of a cgroup over a
> > > throttle limit and no inodes on it's dirty list. It's not a case of
> > > "probably can live with" the resultant mess, the mess will occur and
> > > so handling it needs to be designed in from the start.
> > 
> > This behavior can happen due to shared page accounting. One possible
> > way to mitigate this problme is to traverse through LRU list of pages
> > of memcg and find an inode to do the writebak. 
> 
> Page LRU ordered writeback is something we need to avoid. It causes
> havok with IO and allocation patterns. Also, how expensive is such a
> walk? If it's a common operation, then it's a non-starter for the
> generic writeback code.
> 

Agreed that LRU ordered writeback needs to be avoided as it is going to be
expensive. That's why the notion of inode on per memcg list and do per inode
writeback. This seems to be only backup plan in case when there is no
inode to do IO due to shared inode accounting issues. 

Do you have ideas on better way to handle it? The other proposal of
maintaining a list memcg_mapping, which tracks which inode this cgroup
has dirtied has been deemed complex and been kind of rejected at least
for the first step.

> BTW, how is "shared page accounting" different to the shared dirty
> inode case we've been discussing?

IIUC, there are two problems.

- Issues because of shared page accounting
- Issues because of shared inode accouting.

So in shared page accounting, if two process do IO to same page, IO gets
charged to cgroup who first touched the page. So if a cgroup is writting
on lots of shared pages, it will be charged to the other cgroup who 
brought the page in memory to begin with and will drive its dirty ratio
up. So this seems to be case of weaker isolation in case of shared pages,
and we got to live with it.

Similarly if inode is shared, inode gets put on the list of memcg who dirtied
it first. So now if two cgroups are dirtying pages on inode, then pages should
be charged to respective cgroup but inode will be only on one memcg and once
writeback is performed it might happen that cgroup is over its background
limit but there are no inodes to do writeback.

> 
> > After yesterday's discussion it looked like people agreed that to 
> > begin with keep it simple and maintain the notion of one inode on
> > one memcg list. So instead of inode being on global bdi dirty list
> > it will be on per memecg per bdi dirty list.
> 
> Good to hear.
> 
> > > how metadata IO is going to be handled by
> > > IO controllers,
> > 
> > So IO controller provides two mechanisms.
> > 
> > - IO throttling(bytes_per_second, io_per_second interface)
> > - Proportional weight disk sharing
> > 
> > In case of proportional weight disk sharing, we don't run into issues of
> > priority inversion and metadata handing should not be a concern.
> 
> Though metadata IO will affect how much bandwidth/iops is available
> for applications to use.

I think meta data IO will be accounted to the process submitting the meta
data IO. (IO tracking stuff will be used only for page cache pages during
page dirtying time). So yes, the process doing meta data IO will be
charged for it. 

I think I am missing something here and not understanding your concern
exactly here.

> 
> > For throttling case, apart from metadata, I found that with simple
> > throttling of data I ran into issues with journalling with ext4 mounuted
> > in ordered mode. So it was suggested that WRITE IO throttling should
> > not be done at device level instead try to do it in higher layers,
> > possibly balance_dirty_pages() and throttle process early.
> 
> The problem with doing it at the page cache entry level is that
> cache hits then get throttled. It's not really a an IO controller at
> that point, and the impact on application performance could be huge
> (i.e. MB/s instead of GB/s).

Agreed that throttling cache hits is not a good idea. Can we determine
if page being asked for is in cache or not and charge for IO accordingly.

> 
> > So yes, I agree that little more documentation and more clarity on this
> > would be good. All this cgroup aware writeback primarily is being done
> > for CFQ's proportional disk sharing at the moment.
> > 
> > > what kswapd is going to do writeback when the pages
> > > it's trying to writeback during a critical low memory event belong
> > > to a cgroup that is throttled at the IO level, etc.
> > 
> > Throttling will move up so kswapd will not be throttled. Even today,
> > kswapd is part of root group and we do not suggest throttling root group.
> 
> So once again you have the problem of writeback from kswapd (which
> is ugly to begin with) affecting all the groups. Given kswapd likes
> to issue what is effectively random IO, this coul dhave devastating
> effect on everything else....

Implementing throttling at higher layer has the problem of IO spikes
at the end level devices when flusher or kswapd decide to do bunch of
IO. I really don't have a good answer for that. Doing throttling at
device level runs into issues with journalling. So I guess issues of
IO spikes is lesser concern as compared to issue of choking filesystem.

Following two things might help though a bit with IO spikes.

- Keep per cgroup background dirty ratio low so that flusher tries to
  flush out pages sooner than later.

- All the IO coming from flusher/kswapd will be going in root group
  from throttling perspective. We can try to throttle it again to
  some reasonable value to reduce the impact of IO spikes. 

Ideas to handle this better?

> 
> > For the case of proportional disk sharing, we will probably account
> > IO to respective cgroups (pages submitted by kswapd) and that should
> > not flush to disk fairly fast and should not block for long time as it is
> > work consering mechanism.
> 
> Well, it depends. I can still see how, with proportional IO, kswapd
> would get slowed cleaning dirty pages on one memcg when there are
> clean pages in another memcg that it could reclaim without doing any
> IO. i.e. it has potential to slow down memory reclaim significantly.
> (Note, I'm assuming proportional IO doesn't mean "no throttling" it
> just means there is a much lower delay on IO issue.)

Proportional IO can delay submitting an IO only if there is IO happening
in other groups. So IO can still be throttled and limits are decided
by fair share of a group. But if other groups are not doing IO and not
using their fair share, then the group doing IO gets bigger share.

So yes, if heavy IO is happening at disk while kswapd is also trying
to reclaim memory, then IO submitted by kswapd can be delayed and
this can slow down reclaim. (Does kswapd has to block after submitting
IO from a memcg. Can't it just move onto next memcg and either free
pages if not dirty, or also submit IO from next memcg?)

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html