Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Fri, 22 Apr 2011 12:21:23 +0800

On Fri, Apr 22, 2011 at 01:20:40AM +0800, Vivek Goyal wrote:
> On Thu, Apr 21, 2011 at 11:06:18PM +0800, Wu Fengguang wrote:
> 
> [..]
> > 
> > You can get meta data "throttling" and performance at the same time.
> > See below ideas.
> > 
> > > > 
> > > > Either way, you have the freedom to test whether the passed filp is a
> > > > normal file or a directory "file", and do conditional throttling.
> > > 
> > > Ok, will look into it. That will probably take care of READS. What 
> > > about WRITES and meta data. Is it safe to assume that any meta data
> > > write will come in some jounalling thread context and not in user 
> > > process context?
> > 
> > It's very possible to throttle meta data READS/WRITES, as long as they
> > can be attributed to the original task (assuming task oriented throttling
> > instead of bio/request oriented).
> 
> Even in bio oriented throttling we attribute the bio to a task and hence
> to the group (atleast as of today). So from that perspective, it should
> not make much difference.

OK, good to learn about that :)

> > 
> > The trick is to separate the concepts of THROTTLING and ACCOUNTING.
> > You can ACCOUNT data and meta data reads/writes to the right task, and
> > only to THROTTLE the task when it's doing data reads/writes.
> 
> Agreed. I too mentioned this idea in one of the mails that account meta data
> but do not throttle meta data and use that meta data accounting to throttle
> data for longer period of times.

That's great.

> For this to implement, I need to know whether an IO is regular IO or
> metadata IO and looks like one of the ways will that filesystems mark
> that info in bio for meta data requests.

OK.

> > 
> > FYI I played the same trick for balance_dirty_pages_ratelimited() for
> > another reason: _accurate_ accounting of dirtied pages.
> > 
> > That trick should play well with most applications who do interleaved
> > data and meta data reads/writes. For the special case of "find" who
> > does pure meta data reads, we can still throttle it by playing another
> > trick: to THROTTLE meta data reads/writes with a much higher threshold
> > than that of data. So normal applications will be almost always be
> > throttled at data accesses while "find" will be throttled at meta data
> > accesses.
> 
> Ok, that makes sense. If an application is doing lots of meta data
> transactions only then try to limit it after some high limit
> 
> I am not very sure if it will run into issues of some file system
> dependencies and hence priority inversion.

It's safe at least for task-context reads? For meta data writes, we
may also differentiate task-context DIRTY, kernel-context DIRTY and
WRITEOUT. We should still be able to throttle task-context meta data
DIRTY, probably not for kernel-context DIRTY, and never for WRITEOUT.

> > For a real example of how it works, you can check this patch (plus the
> > attached one)
> 
> Ok, I will go through the patches for more details.

Thanks!
FYI this document describes the basic ideas in the first 14 pages.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf

> > 
> > writeback: IO-less balance_dirty_pages()
> > http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=commitdiff;h=e0de5e9961eeb992f305e877c5ef944fcd7a4269;hp=992851d56d79d227beaba1e4dcc657cbcf815556
> > 
> > Where tsk->nr_dirtied does dirty ACCOUNTING and tsk->nr_dirtied_pause
> > is the threshold for THROTTLING. When
> > 
> >         tsk->nr_dirtied > tsk->nr_dirtied_pause
> > 
> > The task will voluntarily enter balance_dirty_pages() for taking a
> > nap (pause time will be proportional to tsk->nr_dirtied), and when
> > finished, start a new account-and-throttle period by resetting
> > tsk->nr_dirtied and possibly adjust tsk->nr_dirtied_pause for a more
> > reasonable pause time at next sleep.
> > 
> > BTW, I'd like to advocate balance_dirty_pages() based IO controller :)
> > 
> 
> Actually implementing throttling in balance_dirty_pages() is not hard. I
> think it has following issues.
> 
> - One controls the IO rate coming into the page cache and does not control
>   the IO rate at the outgoing devices. So a flusher thread can still throw
>   lots of writes at a device and completely disrupting read latencies.
> 
>   If buffered WRITES can disrupt READ latencies unexpectedly, then it kind
>   of renders IO controller/throttling useless.

Hmm..I doubt IO controller is the right solution to this problem at all.

It's such a fundamental problem that it would be Linux's failure to
recommend normal users to use IO controller for the sake of good read
latencies in the presence of heavy writes.

It actually helps reducing seeks when the flushers submit async write
requests in bursts (eg. 1 second). It will then kind of optimally
"work on this bdi area on behalf of this flusher for 1 second, and
then to the other area for 1 second...". The IO scheduler should have
similar optimizations, which should generally work better with more
clustered data supplies from the flushers. (Sorry I'm not tracking the
cfq code, so it's all general hypothesis and please correct me...)

The IO scheduler looks like the right owner to safeguard read latencies.
Where you already have the commit 365722bb917b08b7 ("cfq-iosched:
delay async IO dispatch, if sync IO was just done") and friends.
They do such a good job that if there are continual reads, the async
writes will be totally starved.

But yeah that still leaves sporadic reads at the mercy of heavy
writes, where the default policy will prefer write throughput to read
latencies.

And there is the "no heavy writes to saturate the disk in long term,
but still temporal heavy writes created by the bursty flushing" case.
In this case the device level throttling has the nice side effect of
smoothing writes out without performance penalties. However, if it's
so useful so that you regard it as an important target, why not build
some smoothing logic into the flushers? It has the great prospect of
benefiting _all_ users _by default_ :)

> - For the application performance, I thought a better mechanism would be
>   that we come up with per cgroup dirty ratio. This is equivalent to
>   partitioning the page cache and coming up with cgroup's share. Now
>   an application can write to this cache as fast as it want and is only
>   throttled either by balance_dirty_pages() rules.
> 
>   All this IO must be going to some device and if an admin has put this cgroup
>   in a low bandwidth group, then pages from this cgroup will be written
>   slowly hence tasks in this group will be blocked for longer time.
> 
>  If we can make this work, then application can write to cache at higher
>  rate at the same time not create a havoc at the end device.  

The memcg dirty ratio is fundamentally different from blkio
throttling. The former aims to eliminate excessive pageout()s when
reclaiming pages from the memcg LRU lists. It treats "dirty pages" as
throttle goal, and has the side effect throttling the task at the rate
the memcg's dirty inodes can be flushed to disk. Its complexity
originates from the correlation with "how the flusher selects the
inodes to writeout". Unfortunately the flusher by nature works in a
coarse way..

OTOH, blkio-cgroup don't need to care about inode selection at all.
It's enough to account and throttle tasks' dirty rate, and let the
flusher freely work on whatever dirtied inodes.

In this manner, blkio-cgroup dirty rate throttling is more user
oriented. While memcg dirty pages throttling looks like a complex
solution to some technical problems (if me understand it right).

The blkio-cgroup dirty throttling code can mainly go to
page-writeback.c, while the memcg code will mainly go to
fs-writeback.c (balance_dirty_pages() will also be involved, but
that's actually a more trivial part).

The correlations seem to be,

- you can get the page tagging functionality from memcg, if doing
  async write throttling at device level

- the side effect of rate limiting by memcg's dirty pages throttling,
  which is much less controllable than blkio-cgroup's rate limiting

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html