Andrew, On Wed, Jun 22, 2011 at 08:20:43AM +0800, Andrew Morton wrote: > On Sun, 19 Jun 2011 23:01:12 +0800 > Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: > > > The max-pause limit helps to keep the sleep time inside > > balance_dirty_pages() within 200ms. The 200ms max sleep means per task > > rate limit of 8pages/200ms=160KB/s, which normally is enough to stop > > dirtiers from continue pushing the dirty pages high, unless there are > > a sufficient large number of slow dirtiers (ie. 500 tasks doing 160KB/s > > will still sum up to 80MB/s, reaching the write bandwidth of a slow disk). > > > > The pass-good limit helps to let go of the good bdi's in the presence of > > a blocked bdi (ie. NFS server not responding) or slow USB disk which for > > some reason build up a large number of initial dirty pages that refuse > > to go away anytime soon. > > The hard-wired numbers and hard-wired assumptions about device speeds > shouldn't be here at all. They will be sub-optimal (and sometimes > extremely so) for all cases. They will become wronger over time. Or > less wrong, depending upon which way they were originally wrong. Yeah the changelog contains many specific numbers as examples.. The 200ms is mainly based on the maximum acceptable human perceptible delays. It may be too large in some cases, so the value will be decreased adaptively in a patch to be submitted later. The DIRTY_MAXPAUSE and DIRTY_PASSGOOD are to get proportional values to dirty_thresh, so is reasonably taken as adaptive. > > + dirty_thresh = hard_dirty_limit(dirty_thresh); > > + if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_MAXPAUSE && > > + jiffies - start_time > MAX_PAUSE) > > + break; > > + if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_PASSGOOD && > > + bdi_dirty < bdi_thresh) > > + break; > > It appears that despite their similarity, DIRTY_MAXPAUSE is a > dimensionless value whereas the units of MAX_PAUSE is jiffies. Perhaps > more care in naming choices would clarify things like this. OK. I'll add _AREA suffixes to the DIRTY_MAXPAUSE macros to tell the difference. > The first comparison might be clearer if it used time_after(). Yeah, changed to time_after(). > Both statements need comments explaining what they do and *why they Here is the patch with more comments :) Thanks, Fengguang --- Subject: writeback: introduce max-pause and pass-good dirty limits Date: Sun Jun 19 22:18:42 CST 2011 The max-pause limit helps to keep the sleep time inside balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which normally is enough to stop dirtiers from continue pushing the dirty pages high, unless there are a sufficient large number of slow dirtiers (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the write bandwidth of a slow disk and hence accumulating more and more dirty pages). The pass-good limit helps to let go of the good bdi's in the presence of a blocked bdi (ie. NFS server not responding) or slow USB disk which for some reason build up a large number of initial dirty pages that refuse to go away anytime soon. For example, given two bdi's A and B and the initial state bdi_thresh_A = dirty_thresh / 2 bdi_thresh_B = dirty_thresh / 2 bdi_dirty_A = dirty_thresh / 2 bdi_dirty_B = dirty_thresh / 2 Then A get blocked, after a dozen seconds bdi_thresh_A = 0 bdi_thresh_B = dirty_thresh bdi_dirty_A = dirty_thresh / 2 bdi_dirty_B = dirty_thresh / 2 The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages will be effectively throttled by condition (nr_dirty < dirty_thresh). This has two problems: (1) we lose the protections for light dirtiers (2) balance_dirty_pages() effectively becomes IO-less because the (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good for IO, but balance_dirty_pages() loses an important way to break out of the loop which leads to more spread out throttle delays. DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is, DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above example while this patch uses the more conservative value 8 so as not to surprise people with too many dirty pages than expected. Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx> --- include/linux/writeback.h | 21 +++++++++++++++++++++ mm/page-writeback.c | 28 ++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+) --- linux-next.orig/include/linux/writeback.h 2011-06-23 10:31:40.000000000 +0800 +++ linux-next/include/linux/writeback.h 2011-06-23 10:31:40.000000000 +0800 @@ -7,6 +7,27 @@ #include <linux/sched.h> #include <linux/fs.h> +/* + * The 1/16 region above the global dirty limit will be put to maximum pauses: + * + * (limit, limit + limit/DIRTY_MAXPAUSE_AREA) + * + * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put + * to loops: + * + * (limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA) + * + * Further beyond, all dirtier tasks will enter a loop waiting (possibly long + * time) for the dirty pages to drop, unless written enough pages. + * + * The global dirty threshold is normally equal to the global dirty limit, + * except when the system suddenly allocates a lot of anonymous memory and + * knocks down the global dirty threshold quickly, in which case the global + * dirty limit will follow down slowly to prevent livelocking all dirtier tasks. + */ +#define DIRTY_MAXPAUSE_AREA 16 +#define DIRTY_PASSGOOD_AREA 8 + struct backing_dev_info; /* --- linux-next.orig/mm/page-writeback.c 2011-06-23 10:31:40.000000000 +0800 +++ linux-next/mm/page-writeback.c 2011-06-23 10:59:47.000000000 +0800 @@ -399,6 +399,11 @@ unsigned long determine_dirtyable_memory return x + 1; /* Ensure that we never return 0 */ } +static unsigned long hard_dirty_limit(unsigned long thresh) +{ + return max(thresh, global_dirty_limit); +} + /* * global_dirty_limits - background-writeback and dirty-throttling thresholds * @@ -716,6 +721,29 @@ static void balance_dirty_pages(struct a io_schedule_timeout(pause); trace_balance_dirty_wait(bdi); + dirty_thresh = hard_dirty_limit(dirty_thresh); + /* + * max-pause area. If dirty exceeded but still within this + * area, no need to sleep for more than 200ms: (a) 8 pages per + * 200ms is typically more than enough to curb heavy dirtiers; + * (b) the pause time limit makes the dirtiers more responsive. + */ + if (nr_dirty < dirty_thresh + + dirty_thresh / DIRTY_MAXPAUSE_AREA && + time_after(jiffies, start_time + MAX_PAUSE)) + break; + /* + * pass-good area. When some bdi gets blocked (eg. NFS server + * not responding), or write bandwidth dropped dramatically due + * to concurrent reads, or dirty threshold suddenly dropped and + * the dirty pages cannot be brought down anytime soon (eg. on + * slow USB stick), at least let go of the good bdi's. + */ + if (nr_dirty < dirty_thresh + + dirty_thresh / DIRTY_PASSGOOD_AREA && + bdi_dirty < bdi_thresh) + break; + /* * Increase the delay for each loop, up to our previous * default of taking a 100ms nap. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html