Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 1 Apr 2011 09:14:25 +1100

On Thu, Mar 31, 2011 at 10:34:03AM -0400, Chris Mason wrote:
> Excerpts from Vivek Goyal's message of 2011-03-31 10:16:37 -0400:
> > On Thu, Mar 31, 2011 at 09:20:02AM +1100, Dave Chinner wrote:
> > 
> > [..]
> > > > It should not happen that flusher
> > > > thread gets blocked somewhere (trying to get request descriptors on
> > > > request queue)
> > > 
> > > A major design principle of the bdi-flusher threads is that they
> > > are supposed to block when the request queue gets full - that's how
> > > we got rid of all the congestion garbage from the writeback
> > > stack.
> > 
> > Instead of blocking flusher threads, can they voluntarily stop submitting
> > more IO when they realize too much IO is in progress. We aready keep
> > stats of how much IO is under writeback on bdi (BDI_WRITEBACK) and
> > flusher tread can use that?
> 
> We could, but the difficult part is keeping the hardware saturated as
> requests complete.  The voluntarily stopping part is pretty much the
> same thing the congestion code was trying to do.

And it was the bit that was causing most problems. IMO, we don't want to
go back to that single threaded mechanism, especially as we have
no shortage of cores and threads available...

> > > There are plans to move the bdi-flusher threads to work queues, and
> > > once that is done all your concerns about blocking and parallelism
> > > are pretty much gone because it's trivial to have multiple writeback
> > > works in progress at once on the same bdi with that infrastructure.
> > 
> > Will this essentially not nullify the advantage of IO less throttling?
> > I thought that we did not want have multiple threads doing writeback
> > at the same time to avoid number of seeks and achieve better throughput.
> 
> Work queues alone are probably not appropriate, at least for spinning
> storage.  It will introduce seeks into what would have been
> sequential writes.  I had to make the btrfs worker thread pools after
> having a lot of trouble cramming writeback into work queues.

That was before the cmwq infrastructure, right? cmwq changes the
behaviour of workqueues in such a way that they can simply be
thought of as having a thread pool of a specific size....

As a strict translation of the existing one flusher thread per bdi,
then only allowing one work at a time to be issued (i.e. workqueue
concurency of 1) would give the same behaviour without having all
the thread management issues. i.e. regardless of the writeback
parallelism mechanism we have the same issue of managing writeback
to minimise seeking. cmwq just makes the implementation far simpler,
IMO.

As to whether that causes seeks or not, that depends on how we are
driving the concurrent works/threads. If we drive a concurrent work
per dirty cgroup that needs writing back, then we achieve the
concurrency needed to make the IO scheduler appropriately throttle
the IO. For the case of no cgroups, then we still only have a single
writeback work in progress at a time and behaviour is no different
to the current setup. Hence I don't see any particular problem with
using workqueues to acheive the necessary writeback parallelism that
cgroup aware throttling requires....

> > > > or it tries to dispatch too much IO from an inode which
> > > > primarily contains pages from low prio cgroup and high prio cgroup
> > > > task does not get enough pages dispatched to device hence not getting
> > > > any prio over low prio group.
> > > 
> > > That's a writeback scheduling issue independent of how we throttle,
> > > and something we don't do at all right now. Our only decision on
> > > what to write back is based on how low ago the inode was dirtied.
> > > You need to completely rework the dirty inode tracking if you want
> > > to efficiently prioritise writeback between different groups.
> > > 
> > > Given that filesystems don't all use the VFS dirty inode tracking
> > > infrastructure and specific filesystems have different ideas of the
> > > order of writeback, you've got a really difficult problem there.
> > > e.g. ext3/4 and btrfs use ordered writeback for filesystem integrity
> > > purposes which will completely screw any sort of prioritised
> > > writeback. Remember the ext3 "fsync = global sync" latency problems?
> > 
> > Ok, so if one issues a fsync when filesystem is mounted in "data=ordered"
> > mode we will flush all the writes to disk before committing meta data.
> > 
> > I have no knowledge of filesystem code so here comes a stupid question.
> > Do multiple fsyncs get completely serialized or they can progress in
> > parallel? IOW, if a fsync is in progress and we slow down the writeback
> > of that inode's pages, can other fsync still make progress without
> > getting stuck behind the previous fsync?
> 
> An fsync has two basic parts
> 
> 1) write the file data pages
> 2a) flush data=ordered in reiserfs/ext34
> 2b) do the real transaction commit
> 
> 
> We can do part one in parallel across any number of writers.  For part
> two, there is only one running transaction.  If the FS is smart, the
> commit will only force down the transaction that last modified the
> file. 50 procs running fsync may only need to trigger one commit.

Right. However the real issue here, I think, is that the IO comes
from a thread not associated with writeback nor is in any way cgroup
aware. IOWs, getting the right context to each block being written
back will be complex and filesystem specific.

The other thing that concerns me is how metadata IO is accounted and
throttled. Doing stuff like creating lots of small files will
generate as much or more metadata IO than data IO, and none of that
will be associated with a cgroup. Indeed, in XFS metadata doesn't
even use the pagecache anymore, and it's written back by a thread
(soon to be a workqueue) deep inside XFS's journalling subsystem, so
it's pretty much impossible to associate that IO with any specific
cgroup.

What happens to that IO? Blocking it arbitrarily can have the same
effect as blocking transaction completion - it can cause the
filesystem to completely stop....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html