Re: IO less throttling and cgroup aware writeback (Was: Re: [Lsf] Preliminary Agenda and Activities for LSF)

Vivek Goyal <vgoyal@xxxxxxxxxx> · Fri, 1 Apr 2011 10:49:02 -0400

On Fri, Apr 01, 2011 at 03:36:05PM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2011 at 09:34:24PM -0400, Vivek Goyal wrote:
> > On Fri, Apr 01, 2011 at 09:14:25AM +1100, Dave Chinner wrote:
> > 
> > [..]
> > > > An fsync has two basic parts
> > > > 
> > > > 1) write the file data pages
> > > > 2a) flush data=ordered in reiserfs/ext34
> > > > 2b) do the real transaction commit
> > > > 
> > > > 
> > > > We can do part one in parallel across any number of writers.  For part
> > > > two, there is only one running transaction.  If the FS is smart, the
> > > > commit will only force down the transaction that last modified the
> > > > file. 50 procs running fsync may only need to trigger one commit.
> > > 
> > > Right. However the real issue here, I think, is that the IO comes
> > > from a thread not associated with writeback nor is in any way cgroup
> > > aware. IOWs, getting the right context to each block being written
> > > back will be complex and filesystem specific.
> > > 
> > > The other thing that concerns me is how metadata IO is accounted and
> > > throttled. Doing stuff like creating lots of small files will
> > > generate as much or more metadata IO than data IO, and none of that
> > > will be associated with a cgroup. Indeed, in XFS metadata doesn't
> > > even use the pagecache anymore, and it's written back by a thread
> > > (soon to be a workqueue) deep inside XFS's journalling subsystem, so
> > > it's pretty much impossible to associate that IO with any specific
> > > cgroup.
> > > 
> > > What happens to that IO? Blocking it arbitrarily can have the same
> > > effect as blocking transaction completion - it can cause the
> > > filesystem to completely stop....
> > 
> > Dave,
> > 
> > As of today, the cgroup/context of IO is decided from the IO submitting
> > thread context. So any IO submitted by kernel threads (flusher, kjournald,
> > workqueue threads) goes to root group IO which should remain unthrottled.
> > (It is not a good idea to put throttling rules for root group).
> > 
> > Now any meta data operation happening in the context of process will
> > still be subject to throttling (is there any?).
> 
> Certainly - almost all metadata _reads_ will occur in process
> context, though for XFS _most_ writes occur in kernel thread context.
> That being said, we can still get kernel threads hung up on metadata
> read IO that has been throttled in process context.
> 
> e.g. a process is creating a new inode, which causes allocation to
> occur, which triggers a read of a free space btree block, which gets
> throttled.  Writeback comes along, tries to do delayed allocation,
> gets hung up trying to allocate out of the same AG that is locked by
> the process creating a new inode. A signle allocation can lock
> multiple AGs, and so if we get enough backed-up allocations this can
> cause all AGs in the filesystem to become locked. AT this point no
> new allocation can complete until the throttled IO is submitted,
> completed and the allocation is committed and the AG unlocked....
> 
> > If that's a concern,
> > can filesystem mark that bio (REQ_META?) and throttling logic can possibly
> > let these bio pass through.
> 
> We already tag most metadata IO in this way.
> 
> However, you can't just not throttle metadata IO. e.g. a process
> doing a directory traversal (e.g. a find) will issue hundreds of IOs
> per second so if you don't throttle them it will adversely affect
> the throughput of other groups that you are trying to guarantee a
> certain throughput or iops rate for. Indeed, not throttling metadata
> writes will seriously harm throughput for controlled cgroups when
> the log fills up and the filesystem pushes out thousands metadata
> IOs in a very short period of time.
> 
> Yet if we combine that with the problem that anywhere you delay
> metadata IO for arbitrarily long periods of time (read or write) via
> priority based mechanisms, you risk causing a train-smash of blocked
> processes all waiting for the throttled IO to complete. And that will
> seriously harm throughput for controlled cgroups because they can't
> make any modifications to the filesystem.
> 
> I'm not sure if there is any middle ground here - I can't see any at
> this point...

This is indeed a tricky situation. Especially the case of write getting
blocked behind reads. I think virtual machine is best use
case where one can avoid using host's file system and avoid all the
issues related to serialization in host file system.

Or we can probably advise not to set very low limits on any cgroup. That
way even if things get serialized, once in a while, it will be resolved
soon. It hurts scalability and performance though.

Or modify file systems where they can mark *selective* meta data IO as
REQ_NOTHROTTLE. If filesystem can determine that a write is dependent on
read meta data request, then mark that read as REQ_NOTHROTTLE. Like in
above example, we are performing a read of free space blktree to do
an allocation of inode. 

Or live with reduced isolation by not throttling meta data IO.

> 
> > Determining the cgroup/context from submitting process has the
> > issue of that any writeback IO is not throttled and we are looking
> > for a way to control buffered writes also. If we start determining
> > the cgroup from some information stored in page_cgroup, then we
> > are more likely to run into issues of priority inversion
> > (filesystem in ordered mode flushing data first before committing
> > meta data changes).  So should we throttle
> > buffered writes when page cache is being dirtied and not when
> > these writes are being written back to device.
> 
> I'm not sure what you mean by this paragraph - AFAICT, this
> is exactly the way we throttle buffered writes right now.

Actually I was referring to throttling in terms of IO rate (bytes_per_second
or io_per_second). Notion of dirty_ratio or dirty_bytes for throttling
itself is not sufficient.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html