On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote: > Old scheme is, > | > free run area | throttle area > ----------------------------------------+----------------------------> > thresh^ dirty pages > > New scheme is, > > ^ task rate limit > | > | * > | * > | * > |[free run] * [smooth throttled] > | * > | * > | * > ..bdi->dirty_ratelimit..........* > | . * > | . * > | . * > | . * > | . * > +-------------------------------.-----------------------*------------> > setpoint^ limit^ dirty pages > > For simplicity, only the global/bdi setpoint control lines are > implemented here, so the [*] curve is more straight than the ideal one > showed in the above figure. > > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so > that the resulted task rate limit can drive the dirty pages back to the > global/bdi setpoints. > IMHO, "position_ratio" is not necessarily very intutive. Can there be a better name? Based on your slides, it is scaling factor applied to task rate limit depending on how well we are doing in terms of meeting our goal of dirty limit. Will "dirty_rate_scale_factor" or something like that make sense and be little more intutive? Thanks Vivek > Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> > --- > mm/page-writeback.c | 143 ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 143 insertions(+) > > --- linux-next.orig/mm/page-writeback.c 2011-08-06 10:31:32.000000000 +0800 > +++ linux-next/mm/page-writeback.c 2011-08-06 11:17:07.000000000 +0800 > @@ -46,6 +46,8 @@ > */ > #define BANDWIDTH_INTERVAL max(HZ/5, 1) > > +#define BANDWIDTH_CALC_SHIFT 10 > + > /* > * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited > * will look to see if it needs to force writeback or throttling. > @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac > return bdi_dirty; > } > > +/* > + * Dirty position control. > + * > + * (o) global/bdi setpoints > + * > + * When the number of dirty pages go higher/lower than the setpoint, the dirty > + * position ratio (and hence dirty rate limit) will be decreased/increased to > + * bring the dirty pages back to the setpoint. > + * > + * setpoint > + * v > + * |-------------------------------*-------------------------------|-----------| > + * ^ ^ ^ ^ > + * (thresh + background_thresh)/2 thresh - thresh/DIRTY_SCOPE thresh limit > + * > + * bdi setpoint > + * v > + * |-------------------------------*-------------------------------------------| > + * ^ ^ ^ > + * 0 bdi_thresh - bdi_thresh/DIRTY_SCOPE limit > + * > + * (o) pseudo code > + * > + * pos_ratio = 1 << BANDWIDTH_CALC_SHIFT > + * > + * if (dirty < thresh) scale up pos_ratio > + * if (dirty > thresh) scale down pos_ratio > + * > + * if (bdi_dirty < bdi_thresh) scale up pos_ratio > + * if (bdi_dirty > bdi_thresh) scale down pos_ratio > + * > + * (o) global/bdi control lines > + * > + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by > + * several control lines in turn. > + * > + * The control lines for the global/bdi setpoints both stretch up to @limit. > + * If any control line drops below Y=0 before reaching @limit, an auxiliary > + * line will be setup to connect them. The below figure illustrates the main > + * bdi control line with an auxiliary line extending it to @limit. > + * > + * This allows smoothly throttling bdi_dirty down to normal if it starts high > + * in situations like > + * - start writing to a slow SD card and a fast disk at the same time. The SD > + * card's bdi_dirty may rush to 5 times higher than bdi setpoint. > + * - the bdi dirty thresh goes down quickly due to change of JBOD workload > + * > + * o > + * o > + * o [o] main control line > + * o [*] auxiliary control line > + * o > + * o > + * o > + * o > + * o > + * o > + * o--------------------- balance point, bw scale = 1 > + * | o > + * | o > + * | o > + * | o > + * | o > + * | o > + * | o------- connect point, bw scale = 1/2 > + * | .* > + * | . * > + * | . * > + * | . * > + * | . * > + * | . * > + * | . * > + * [--------------------+-----------------------------.--------------------*] > + * 0 bdi setpoint bdi origin limit > + * > + * The bdi control line: if (origin < limit), an auxiliary control line (*) > + * will be setup to extend the main control line (o) to @limit. > + */ > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, > + unsigned long thresh, > + unsigned long dirty, > + unsigned long bdi_thresh, > + unsigned long bdi_dirty) > +{ > + unsigned long limit = hard_dirty_limit(thresh); > + unsigned long origin; > + unsigned long goal; > + unsigned long long span; > + unsigned long long pos_ratio; /* for scaling up/down the rate limit */ > + > + if (unlikely(dirty >= limit)) > + return 0; > + > + /* > + * global setpoint > + */ > + goal = thresh - thresh / DIRTY_SCOPE; > + origin = 4 * thresh; > + > + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) { > + origin = limit; /* auxiliary control line */ > + goal = (goal + origin) / 2; > + pos_ratio >>= 1; > + } > + pos_ratio = origin - dirty; > + pos_ratio <<= BANDWIDTH_CALC_SHIFT; > + do_div(pos_ratio, origin - goal + 1); > + > + /* > + * bdi setpoint > + */ > + if (unlikely(bdi_thresh > thresh)) > + bdi_thresh = thresh; > + goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE; > + /* > + * Use span=(4*bw) in single disk case and transit to bdi_thresh in > + * JBOD case. For JBOD, bdi_thresh could fluctuate up to its own size. > + * Otherwise the bdi write bandwidth is good for limiting the floating > + * area, which makes the bdi control line a good backup when the global > + * control line is too flat/weak in large memory systems. > + */ > + span = (u64) bdi_thresh * (thresh - bdi_thresh) + > + (4 * bdi->avg_write_bandwidth) * bdi_thresh; > + do_div(span, thresh + 1); > + origin = goal + 2 * span; > + > + if (unlikely(bdi_dirty > goal + span)) { > + if (bdi_dirty > limit) > + return 0; > + if (origin < limit) { > + origin = limit; /* auxiliary control line */ > + goal += span; > + pos_ratio >>= 1; > + } > + } > + pos_ratio *= origin - bdi_dirty; > + do_div(pos_ratio, origin - goal + 1); > + > + return pos_ratio; > +} > + > static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, > unsigned long elapsed, > unsigned long written) > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html