Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 31 Mar 2011 09:20:02 +1100

On Wed, Mar 30, 2011 at 11:37:57AM -0400, Vivek Goyal wrote:
> On Wed, Mar 30, 2011 at 03:18:02PM +1100, Dave Chinner wrote:
> > On Tue, Mar 29, 2011 at 10:35:13AM -0700, Chad Talbott wrote:
> > > I'd like to propose a discussion topic:
> > > 
> > > IO-less Dirty Throttling Considered Harmful...
> > > 
> > > to isolation and cgroup IO schedulers in general.
> > 
> > Why is that, exactly? The current writeback infrastructure isn't
> > cgroup aware at all, so isn't that the problem you need to solve
> > first?  i.e. how to delegate page cache writeback from
> > one context to anotheri and account for it correctly?
> > 
> > Once you solve that problem, triggering cgroup specific writeback
> > from the throttling code is the same regardless of whether we
> > are doing IO directly from the throttling code or via a separate
> > flusher thread. Hence I don't really understand why you think
> > IO-less throttling is really a problem.
> 
> Dave,
> 
> We are planning to track the IO context of original submitter of IO
> by storing that information in page_cgroup. So that is not the problem.
> 
> The problem google guys are trying to raise is that can a single flusher
> thread keep all the groups on bdi busy in such a way so that higher
> prio group can get more IO done.

Which has nothing to do with IO-less dirty throttling at all!

> It should not happen that flusher
> thread gets blocked somewhere (trying to get request descriptors on
> request queue)

A major design principle of the bdi-flusher threads is that they
are supposed to block when the request queue gets full - that's how
we got rid of all the congestion garbage from the writeback
stack.

There are plans to move the bdi-flusher threads to work queues, and
once that is done all your concerns about blocking and parallelism
are pretty much gone because it's trivial to have multiple writeback
works in progress at once on the same bdi with that infrastructure.

> or it tries to dispatch too much IO from an inode which
> primarily contains pages from low prio cgroup and high prio cgroup
> task does not get enough pages dispatched to device hence not getting
> any prio over low prio group.

That's a writeback scheduling issue independent of how we throttle,
and something we don't do at all right now. Our only decision on
what to write back is based on how low ago the inode was dirtied.
You need to completely rework the dirty inode tracking if you want
to efficiently prioritise writeback between different groups.

Given that filesystems don't all use the VFS dirty inode tracking
infrastructure and specific filesystems have different ideas of the
order of writeback, you've got a really difficult problem there.
e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
purposes which will completely screw any sort of prioritised
writeback. Remember the ext3 "fsync = global sync" latency problems?

> Currently we can do some IO in the context of writting process also
> hence faster group can try to dispatch its own pages to bdi for writeout.
> With IO less throttling, that notion will disappear.

We'll stil do exactly the same amount of throttling - what we write
back is still the same decision, just made in a different place with
a different trigger.

> So the concern they raised that is single flusher thread per device
> is enough to keep faster cgroup full at the bdi and hence get the
> service differentiation.

I think there's much bigger problems than that.

> My take on this is that on slow SATA device it might be as long as
> we make sure that flusher thread does not block on individual groups

I don't think you can ever guarantee that - e.g. Delayed allocation
will need metadata to be read from disk to perform the allocation
so preventing blocking is impossible. Besides, see above about using
work queues rather than threads for flushing.

> and also try to select inodes intelligently (cgroup aware manner).

Such selection algorithms would need to be able to handle hundreds
of thousands of newly dirtied inodes per second so sorting and
selecting them efficiently will be a major issue...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html