Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF)

Vivek Goyal <vgoyal@xxxxxxxxxx> · Fri, 22 Apr 2011 11:25:31 -0400

On Fri, Apr 22, 2011 at 12:21:23PM +0800, Wu Fengguang wrote:

[..]
> > > BTW, I'd like to advocate balance_dirty_pages() based IO controller :)
> > > 
> > 
> > Actually implementing throttling in balance_dirty_pages() is not hard. I
> > think it has following issues.
> > 
> > - One controls the IO rate coming into the page cache and does not control
> >   the IO rate at the outgoing devices. So a flusher thread can still throw
> >   lots of writes at a device and completely disrupting read latencies.
> > 
> >   If buffered WRITES can disrupt READ latencies unexpectedly, then it kind
> >   of renders IO controller/throttling useless.
> 
> Hmm..I doubt IO controller is the right solution to this problem at all.
> 
> It's such a fundamental problem that it would be Linux's failure to
> recommend normal users to use IO controller for the sake of good read
> latencies in the presence of heavy writes.

It is and we have modified CFQ a lot to tackle that but still... 

Just do a "dd if=/dev/zero of=/zerofile bs=1M count=4K" on your root
disk and then try to launch firefox and browse few websites and see if
you are happy with the response of the firefox. It took me more than
a minute to launch firefox and be able to input and load first website.

But I agree that READ latencies in presence of WRITES can be a problem
independent of IO controller.

Also there is another case of cluster where IO is coming to storage from
multiple hosts and one does not probably want a flurry of WRITES from
one host so that IO of other hosts is not severely impacted. In that case
IO scheduler can't do much as it has the view of single system.  

Secondly, the whole thing with IO controller is that it provides user more
control of IO instead of living with a default system specific policy. For
example, an admin might want to just look for better latencies for READS
and is willing to give up on WRITE throughput. So if IO controller is
properly implemented, he might say that my WRITE intensive application
I am putting in a cgroup with WRITE limit of 20MB/s. Now the READ
latencies in root cgroup should be better and may be predictable too as
we know the WRITE rate to disk never exceedes 20MB/s. 

Also it is only CFQ which provides READS so much preferrence over WRITES.
deadline and noop do not which we typically use on faster storage. There
we might take a bigger hit on READ latencies depending on what storage
is and how effected it is with a burst of WRITES.

I guess it boils down to better system control and better predictability.

So I think throttling buffered writes in balance_dirty_pages() is better
than not providing any way to control buffered WRITES at all but controlling
it at end device provides much better control on IO and serves more use
cases.

> 
> It actually helps reducing seeks when the flushers submit async write
> requests in bursts (eg. 1 second). It will then kind of optimally
> "work on this bdi area on behalf of this flusher for 1 second, and
> then to the other area for 1 second...". The IO scheduler should have
> similar optimizations, which should generally work better with more
> clustered data supplies from the flushers. (Sorry I'm not tracking the
> cfq code, so it's all general hypothesis and please correct me...)
> 

Isolation and throughput are orthogonal. You go for better isolation
and you will esentially pay by reduced throughput. Now as a user one
can decide what are his priorities. I see it as a slider where on one
end it is 100% isolation and on other end it is 100% throughput. Now
a user can slide the slider and keep that somewhere in between depending
on his/her needs. One of the goals of IO controller is to provide that
fine grained control. By implementing throttling in balance_dirty_pages()
we really lose that capability.

Also flusher still will submit the requests in burst. flusher will still
pick one inode at a time so IO is as sequential as possible. We will still
do the IO-lesss throttling to reduce the seeks.  If we do IO throttling
below page cache, it also gives us the capability to control flusher
IO burst. Gives user a fine grained control which is lost if we do
the control while entering page cache. 

> The IO scheduler looks like the right owner to safeguard read latencies.
> Where you already have the commit 365722bb917b08b7 ("cfq-iosched:
> delay async IO dispatch, if sync IO was just done") and friends.
> They do such a good job that if there are continual reads, the async
> writes will be totally starved.
> 
> But yeah that still leaves sporadic reads at the mercy of heavy
> writes, where the default policy will prefer write throughput to read
> latencies.

Well, there is no default policy as such. CFQ tries to prioritize READs
as much as it can.  Deadline does not as much. So as I said previously,
we really are not controlling the burst. We are leaving it to IO scheduler
to handle it as per its policy and lose isolation between the groups which
is primary purpose of IO controller.

IOW, doing throttling below page cache allows us much better/smoother
control of IO.

> 
> And there is the "no heavy writes to saturate the disk in long term,
> but still temporal heavy writes created by the bursty flushing" case.
> In this case the device level throttling has the nice side effect of
> smoothing writes out without performance penalties. However, if it's
> so useful so that you regard it as an important target, why not build
> some smoothing logic into the flushers? It has the great prospect of
> benefiting _all_ users _by default_ :)

We already have implemented the control at lower layers. So we really
don't have to build secondary control now. Just that rest of the
subsystems have to be aware of cgroups and play nicely.

At high level smoothing logic is just another throttling technique.
Whether to throttle process abruptly or try to put more complex technique
to smooth out the traffic. It is just a knob. The key question here
is where to put the knob in stack for maximum degree of control.

flusher logic is already complicated. I am not sure what we will gain
by training flushers about the IO rate and throttling it based on user
policies. We can let lower layers do it as long as we can make sure
flusher is aware of cgroups and can select inodes to flush in such a
manner that it does not get blocked behind slow cgroups and can keep
all the cgroups busy.

The challenge I am facing here is the file system dependencies on IO.
One example is that if I throttle fsync IO, then it leads to issues
with journalling and other IO in filesystem seems to be stopping.

> 
> > - For the application performance, I thought a better mechanism would be
> >   that we come up with per cgroup dirty ratio. This is equivalent to
> >   partitioning the page cache and coming up with cgroup's share. Now
> >   an application can write to this cache as fast as it want and is only
> >   throttled either by balance_dirty_pages() rules.
> > 
> >   All this IO must be going to some device and if an admin has put this cgroup
> >   in a low bandwidth group, then pages from this cgroup will be written
> >   slowly hence tasks in this group will be blocked for longer time.
> > 
> >  If we can make this work, then application can write to cache at higher
> >  rate at the same time not create a havoc at the end device.  
> 
> The memcg dirty ratio is fundamentally different from blkio
> throttling. The former aims to eliminate excessive pageout()s when
> reclaiming pages from the memcg LRU lists. It treats "dirty pages" as
> throttle goal, and has the side effect throttling the task at the rate
> the memcg's dirty inodes can be flushed to disk. Its complexity
> originates from the correlation with "how the flusher selects the
> inodes to writeout". Unfortunately the flusher by nature works in a
> coarse way..

memcg dirty ratio is a different problem but it needs to work with IO
controller to solve the whole issue. If IO was just direct IO, and no
page cache in picture we don't need memcg. But the momemnt, page cache
comes into the picture, immediately comes the notion of logically
dividing that page cache among cgroups. And comes the notion of
dirty ratio per cgroup so that even if the overall cache usage is
less but this cgroups has consumed its share of dirty pages and now we
need to throttle it and ask flusher to send IO to underlying devices.

IO controller is sitting below page cache. So we need to make sure
that memcg is enhanced to support per cgroup dirt ratio, and train flusher
threads so that they are aware of cgroup presence and can do writeout
in per memcg aware manner. Greg Thelen is working on putting these two
pieces together.

So memcg dirty ratio is a different problem but is required to make IO
controller work for buffered WRITES.

> 
> OTOH, blkio-cgroup don't need to care about inode selection at all.
> It's enough to account and throttle tasks' dirty rate, and let the
> flusher freely work on whatever dirtied inodes.

That goes back to the model of putting the knob in balance_dirty_pages().
Yes it simplifies the implementation but also takes away the capability
of better control. One would still see the burst of WRITES at end devices.

> 
> In this manner, blkio-cgroup dirty rate throttling is more user
> oriented. While memcg dirty pages throttling looks like a complex
> solution to some technical problems (if me understand it right).

If we implement IO throttling in balance_dirty_pages(), then we don't 
require memcg dirty ratio thing for it to work. But we will still reuire
memcg dirty ratio for other reasons.

- Prportional IO control for CFQ
- memcg's own problems of starting to write out pages from a cgroup
  earlier.

> 
> The blkio-cgroup dirty throttling code can mainly go to
> page-writeback.c, while the memcg code will mainly go to
> fs-writeback.c (balance_dirty_pages() will also be involved, but
> that's actually a more trivial part).
> 
> The correlations seem to be,
> 
> - you can get the page tagging functionality from memcg, if doing
>   async write throttling at device level
> 
> - the side effect of rate limiting by memcg's dirty pages throttling,
>   which is much less controllable than blkio-cgroup's rate limiting

Well, I thought memcg's per cgroup ratio and IO controller's rate limit
will work together. memcgroup will keep track of per cgroup share of 
page cache and when caches usage is more than certain %, it will ask
flusher to send IO to device and then IO controller will throttle that
IO. Now if rate limit of the cgroup is less, then task of that cgroup
will be throttled for longer in balance_dirty_pages(). 

So throttling is happening at two layers. One throttling is in
balance_dirty_pages() which is actually not dependent on user inputted
parameters. It is more dependent on what's the page cache share of 
this cgroup and what's the effecitve IO rate this cgroup is getting.
The real IO throttling is happning at device level which is dependent
on parameters inputted by user and which in-turn indirectly should decide
how tasks are throttled in balance_dirty_pages().

I have yet to look at your implementation of throttling but keep in 
mind that once IO controller comes into picture the throttling/smoothing
mechanism also needs to be able to take into account direct writes and
we should be able to use same algorithms for throttling READS.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html