Peter, Sorry for the delay.. On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote: > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote: To start with, write_bw ref_bw = task_ratelimit_in_past_200ms * -------- dirty_bw where task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio > > Now all of the above would seem to suggest: > > > > dirty_ratelimit := ref_bw Right, ideally ref_bw is the balanced dirty ratelimit. I actually started with exactly the above equation when I got choked by pure pos_bw based feedback control (as mentioned in the reply to Jan's email) and introduced the ref_bw estimation as the way out. But there are some imperfections in ref_bw, too. Which makes it not suitable for direct use: 1) large fluctuations The dirty_bw used for computing ref_bw is merely averaged in the past 200ms (very small comparing to the 3s estimation period in write_bw), which makes rather dispersed distribution of ref_bw. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png Take a look at the blue [*] points in the above graph. I find it pretty hard to average out the singular points by increasing the estimation period. Considering that the averaging technique will introduce the very undesirable time lags, I give it up totally. (btw, the write_bw averaging time lag is much more acceptable because its impact is one-way and therefore won't lead to oscillations.) The one practical way is filtering -- the most large singular ref_bw points can be filtered out effectively by remembering some prev_ref_bw and prev_prev_ref_bw. However it cannot do away all of them. And the remaining majority ref_bw points are still randomly dancing around the ideal balanced rate. 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw) becomes unbalanced match, which leads to large systematical errors in ref_bw. The truncates, due to its possibly bumpy nature, can hardly be compensated smoothly. So let's face it. When some over-estimated ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the pos_bw will in turn become lower than ->dirty_ratelimit. So if we consider both ref_bw and pos_bw and update ->dirty_ratelimit only when they are on the same side of ->dirty_ratelimit, the systematical errors in ref_bw won't be able to bring ->dirty_ratelimit too away. The ref_bw estimation is also not accurate when near the max pause and free run areas. 3) since we ultimately want to - keep the dirty pages around the setpoint as long time as possible - keep the fluctuations of task ratelimit as small as possible the update policy used for (2) also serves the above goals nicely: if for some reason the dirty pages are high (pos_bw < dirty_ratelimit), and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no point to bring up dirty_ratelimit in a hurry and to hurt both the above two goals. > > However for that you use: > > > > if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit) > > dirty_ratelimit = max(ref_bw, pos_bw); > > > > if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit) > > dirty_ratelimit = min(ref_bw, pos_bw); The above are merely constraints to the dirty_ratelimit update. It serves to 1) stop adjusting the rate when it's against the position control target (the adjusted rate will slow down the progress of dirty pages going back to setpoint). 2) limit the step size. pos_bw is changing values step by step, leaving a consistent trace comparing to the randomly jumping ref_bw. pos_bw also has smaller errors in stable state and normally have larger errors when there are big errors in rate. So it's a pretty good limiting factor for the step size of dirty_ratelimit. > > You have: > > > > pos_bw = dirty_ratelimit * pos_ratio > > > > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me.. > > why are you ignoring the shift in output vs input rate there? Again, you need to understand pos_bw the other way. Only (pos_bw - dirty_ratelimit) matters here, which is exactly the position error. > Could you elaborate on this primary feedback loop? Its the one part I > don't feel I actually understand well. Hope the above elaboration helps :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html