Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 02, 2012 at 11:39:53AM +0100, Jan Kara wrote:
> On Thu 02-02-12 15:52:34, Wu Fengguang wrote:
> > On Thu, Feb 02, 2012 at 02:33:45PM +0800, Wu Fengguang wrote:
> > > Hi Greg,
> > > 
> > > On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > > > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > > > <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> > > > > 4. dirty ratio
> > > > >   In the last year, patches were posted but not merged. I'd like to hear
> > > > >   works on this area.
> > > > 
> > > > I would like to attend to discuss this topic.  I have not had much time to work
> > > > on this recently, but should be able to focus more on this soon.  The
> > > > IO less writeback changes require some redesign and may allow for a
> > > > simpler implementation of mem_cgroup_balance_dirty_pages().
> > > > Maintaining a per container dirty page counts, ratios, and limits is
> > > > fairly easy, but integration with writeback is the challenge.  My big
> > > > questions are for writeback people:
> > > > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > > > dirty page usage.
> > > > 2. how to ensure that writeback will engage even if system and bdi are
> > > > below respective background dirty ratios, yet a memcg is above its bg
> > > > dirty limit.
> > > 
> > > The solution to (1,2) would be something like this:
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> > > @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
> > >  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > >  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > >  
> > > +	if (memcg) {
> > > +		long long f;
> > > +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> > > +			    memcg_limit - memcg_setpoint + 1);
> > > +		f = x;
> > > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > > +		f = f * x >> RATELIMIT_CALC_SHIFT;
> > > +		f += 1 << RATELIMIT_CALC_SHIFT;
> > > +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> > > +	}
> > > +
> > >  	/*
> > >  	 * We have computed basic pos_ratio above based on global situation. If
> > >  	 * the bdi is over/under its share of dirty pages, we want to scale
> > > @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
> > >  		freerun = dirty_freerun_ceiling(dirty_thresh,
> > >  						background_thresh);
> > >  		if (nr_dirty <= freerun) {
> > > +			if (memcg && memcg_dirty > memcg_freerun)
> > > +				goto start_writeback;
> > >  			current->dirty_paused_when = now;
> > >  			current->nr_dirtied = 0;
> > >  			current->nr_dirtied_pause =
> > > @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
> > >  			break;
> > >  		}
> > >  
> > > +start_writeback:
> > >  		if (unlikely(!writeback_in_progress(bdi)))
> > >  			bdi_start_background_writeback(bdi);
> > >  
> > > 
> > > That makes the minimal change to enforce per-memcg dirty ratio.
> > > It could result in a less stable control system, but should still
> > > be able to balance things out.
> > 
> > Unfortunately the memcg partitioning could fundamentally make the
> > dirty throttling more bumpy.
> > 
> > Imagine 10 memcgs each with
> > 
> > - memcg_dirty_limit=50MB
> > - 1 dd dirty task
> > 
> > The flusher thread will be working on 10 inodes in turn, each time
> > grabbing the next inode and taking ~0.5s to write ~50MB of its dirty
> > pages to the disk. So each inode will be flushed on every ~5s.
> > 
> > Without memcg dirty ratio, the dd tasks will be throttled quite
> > smoothly.  However with memcg, each memcg will be limited to 50MB
> > dirty pages, and the dirty number will be dropping quickly from 50MB
> > to 0 on every 5 seconds.
> >
> > As a result, the small partitions of dirty pages will transmit the
> > flusher's bumpy writeout (which is necessary for performance) to the
> > dd tasks' bumpy progress. The dd tasks will be blocked for seconds
> > from time to time.
> > 
> > So I cannot help thinking: can the problem be canceled in the root?
> > The basic scheme could be: when reclaiming from a memcg zone, if any
> > PG_writeback/PG_dirty pages are encountered, mark PG_reclaim on it and
> > move it to the global zone and de-account it from the memcg.
> > 
> > In this way, we can avoid dirty/writeback pages hurting the (possibly
> > small) memcg zones. The aggressive dirtier tasks will be throttled by
> > the global 20% limit and the memcg page reclaims can go on smoothly.
>   If I remember Google's usecase right, their ultimate goal is to partition
> the machine so that processes in memcg A get say 1/4 of the available
> disk bandwidth, processes in memcg B get 1/2 of the disk bandwidth.
> 
> Now you can do the bandwidth limitting in CFQ but it doesn't really work
> for buffered writes because these are done by flusher thread ignoring any
> memcg boundaries. So they introduce knowledge of memcgs into flusher thread
> so that writeback done by flusher thread reflects the configured
> proportions.

Actually the dirty rate can be controlled independent from the dirty pages:

blk-cgroup: async write IO controller 
https://github.com/fengguang/linux/commit/99b1ca4549a79af736ab03247805f6a9fc31ca2d

> But then the result is that processes in memcg A will simply accumulate
> more dirty pages because writeback is slower for them. So that's why you
> want to stop dirtying processes in that memcg when they reach their

The bandwidth control alone will be pretty smooth, not suffering from
the partition problem. And it don't need to alter the flusher behavior
(like make it focusing on some inodes) and hence won't impact performance.

If memcg A's dirty rate is throttled, its dirty pages will naturally
shrink. The flusher will automatically work less on A's dirty pages.

> dirty_limit. All in all, I believe more bumpy writeback / lower throughput
> (you can choose between these two) is unavoidable for this usecase. But
> OTOH I'm not sure how big problem this will be in practice because machines
> should be big enough so that even after partitioning you get a resonably
> sized machine...

The end user may expect big machines to handle 100 or even 1000 memcg,
so if each memcg corresponds to 1 dd, 1 dirty inode and 50MB dirty limit,
each inode will be waiting for 50 or 500 seconds to be flushed once.
The stall time will then go up to dozens/hundreds of seconds...  The
partition scheme simply won't scale...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]