Re: [PATCH 2/5] writeback: dirty position control

Jan Kara <jack@xxxxxxx> · Thu, 18 Aug 2011 21:16:12 +0200



On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@xxxxxxx>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html