Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 2 Apr 2011 08:49:47 +1100

On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote:
> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote:
> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote:
> > > There
> > > is no context (memcg or otherwise) given to the bdi flusher.  After
> > > the bdi flusher checks system-wide background limits, it uses the
> > > over_bg_limit list to find (and rotate) an over limit memcg.  Using
> > > the memcg, then the per memcg per bdi dirty inode list is walked to
> > > find inode pages to writeback.  Once the memcg dirty memory usage
> > > drops below the memcg-thresh, the memcg is removed from the global
> > > over_bg_limit list.
> > 
> > If you want controlled hand-off of writeback, you need to pass the
> > memcg that triggered the throttling directly to the bdi. You already
> > know what both the bdi and memcg that need writeback are. Yes, this
> > needs concurrency at the BDI flush level to handle, but see my
> > previous email in this thread for that....
> > 
> 
> Even with memcg being passed around I don't think that we get rid of
> global list lock.

You need to - we're getting rid of global lists and locks from
writeback for scalability reasons so any new functionality needs to
avoid global locks for the same reason.

> The reason being that inodes are not exclusive to
> the memory cgroups. Multiple memory cgroups might be writting to same
> inode. So inode still remains in the global list and memory cgroups
> kind of will have pointer to it.

So two dirty inode lists that have to be kept in sync? That doesn't
sound particularly appealing. Nor does it scale to an inode being
dirty in multiple cgroups

Besides, if you've got multiple memory groups dirtying the same
inode, then you cannot expect isolation between groups. I'd consider
this a broken configuration in this case - how often does this
actually happen, and what is the use case for supporting
it?

Besides, the implications are that we'd have to break up contiguous
IOs in the writeback path simply because two sequential pages are
associated with different groups. That's really nasty, and exactly
the opposite of all the write combining we try to do throughout the
writeback path. Supporting this is also a mess, as we'd have to touch
quite a lot of filesystem code (i.e. .writepage(s) inplementations)
to do this.

> So to start writeback on an inode
> you still shall have to take global lock, IIUC.

Why not simply bdi -> list of dirty cgroups -> list of dirty inodes
in cgroup, and go from there? I mean, really all that cgroup-aware
writeback needs is just adding a new container for managing
dirty inodes in the writeback path and a method for selecting that
container for writeback, right? 

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html