On Wed, Sep 28, 2011 at 10:50:35PM +0800, Peter Zijlstra wrote: > On Wed, 2011-09-28 at 22:02 +0800, Wu Fengguang wrote: > > /me attempts to swap back neurons related to writeback > > > After lots of experiments, I end up with this bdi reserve point > > > > + x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES; > > > > together with this chunk to avoid a bdi stuck in bdi_thresh=0 state: > > > > @@ -590,6 +590,7 @@ static unsigned long bdi_position_ratio( > > */ > > if (unlikely(bdi_thresh > thresh)) > > bdi_thresh = thresh; > > + bdi_thresh = max(bdi_thresh, (limit - dirty) / 8); > > /* > > * scale global setpoint to bdi's: > > * bdi_setpoint = setpoint * bdi_thresh / thresh > > So you cap bdi_thresh at a minimum of (limit-dirty)/8 which can be > pretty close to 0 if we have a spike in dirty or a negative spike in > writeout bandwidth (sudden seeks or whatnot). That's right. However to bring bdi_thresh out of the close-to-zero state, it's only required that (limit-dirty)/8 is reasonable large for the _majority_ time, which is not a problem for the servers unless something goes wrong. > > > The above changes are good enough to keep reasonable amount of bdi > > dirty pages, so the bdi underrun flag ("[PATCH 11/18] block: add bdi > > flag to indicate risk of io queue underrun") is dropped. > > That sounds like goodness ;-) Yeah! > > I also tried various bdi freerun patches, however the results are not > > satisfactory. Basically the bdi reserve area approach (this patch) > > yields noticeably more smooth/resilient behavior than the > > freerun/underrun approaches. I noticed that the bdi underrun flag > > could lead to sudden surge of dirty pages (especially if not > > safeguarded by the dirty_exceeded condition) in the very small > > window.. > > OK, so let me try and parse this magic: > > + x_intercept = bdi_thresh / 2 + MIN_WRITEBACK_PAGES; > + if (bdi_dirty < x_intercept) { > + if (bdi_dirty > x_intercept / 8) { > + pos_ratio *= x_intercept; > + do_div(pos_ratio, bdi_dirty); > + } else > + pos_ratio *= 8; > + } > > So we set our target some place north of MIN_WRITEBACK_PAGES: if we're > short we add a factor of: x_intercept/bdi_dirty. > > Now, since bdi_dirty < x_intercept, this is > 1 and thus we promote more > dirties. That's right. > Additionally we don't let the factor get larger than 8 to avoid silly > large fluctuations (8 already seems quite generous to me). I actually increased 8 to 128 and still think it safe: for the promotion ratio to be 128, bdi_dirty should be around bdi_thresh/2/128 (or 0.4% bdi_thresh). Whatever large the promotion ratio is, it won't be more radical than some bdi freerun threshold. In the tests, what the bdi reserve area protect is mainly small memory systems (small dirty threshold comparing to writeout bandwidth), where an IO completion could bring down bdi_dirty considerably (relatively) and we really need to ramp it up fast at the point to feed the disk. > Now I guess the only problem is when nr_bdi * MIN_WRITEBACK_PAGES ~ > limit, at which point things go pear shaped. Yes. In that case the global @dirty will always be drove up to @limit. Once @dirty dropped reasonably below, whichever bdi task wakeup first will take the chance to fill the gap, which is not fair for bdi's of different speed. Let me retry the thresh=1M,10M test cases without MIN_WRITEBACK_PAGES. Hopefully the removal of it won't impact performance a lot. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html