Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 31 Mar 2011 14:00:33 +1100

On Wed, Mar 30, 2011 at 03:49:17PM -0700, Chad Talbott wrote:
> On Wed, Mar 30, 2011 at 3:20 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote:
> >> We are planning to track the IO context of original submitter of IO
> >> by storing that information in page_cgroup. So that is not the problem.
> >>
> >> The problem google guys are trying to raise is that can a single flusher
> >> thread keep all the groups on bdi busy in such a way so that higher
> >> prio group can get more IO done.
> >
> > Which has nothing to do with IO-less dirty throttling at all!
> 
> Not quite.  Pre IO-less dirty throttling, any thread which was
> dirtying did the writeback itself.  Because there's no shortage of
> threads to do the work, the IO scheduler sees a bunch of threads doing
> writes against a given BDI and schedules them against each other.
> This is how async IO isolation works for us.

And it's precisely this behaviour that makes foreground throttling a
scalability limitation, both from a list/lock contention POV and
from a IO optimisation POV.

> >> So the concern they raised that is single flusher thread per device
> >> is enough to keep faster cgroup full at the bdi and hence get the
> >> service differentiation.
> >
> > I think there's much bigger problems than that.
> 
> We seem to be agreeing that it's a complicated problem.  That's why I
> think async write isolation needs some design-level discussion.

>From my perspeccctive, we've still got a significant amount of work
to get writeback into a scalable form for current generation
machines, let alone future machines.

Fixing the writeback code is a slow process because of all the
subtle interactions with different filesystems and different
workloads, whÑch made more complex by the fact that many filesystems
implement their own writeback paths and have their own writeback
semantics. We need to make the right decision on what IO to issue,
not just issue lots of IO and hope it all turns out OK in the end.
If we can't get that decision matrix right for the simple case of a
global context, then we have no hope of extending it to cgroup-aware
writeback.

IOWs, we need to get writeback working in a scalable manner before
we complicate it immensely with all this cgroup and isolation
madness. Hence I think trying to make writeback cgroup-aware is
probably 6-12 months premature at this point and trying to do it now
will only serve to make it harder to get the common, simple cases
working as we desire them to...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html