On Wed, Feb 23, 2011 at 11:13:22PM +0800, Wu Fengguang wrote: > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6 > > As you can see from the graphs, the write bandwidth, the dirty > throttle bandwidths and the number of dirty pages are all fluctuating. > Fluctuations are regular for as simple as dd workloads. > > The current threshold based balance_dirty_pages() has the effect of > keeping the number of dirty pages close to the dirty threshold at most > time, at the cost of directly passing the underneath fluctuations to > the application. As a result, the dirtier tasks are swinging from > "dirty as fast as possible" and "full stop" states. The pause time > in current balance_dirty_pages() are measured to be random numbers > between 0 and hundreds of milliseconds for local ext4 filesystem and > more for NFS. > > Obviously end users are much more sensitive to the fluctuating > latencies than the fluctuation of dirty pages. It makes much sense to > expand the current on/off dirty threshold to some kind of dirty range > control, absorbing the fluctuation of dirty throttle latencies by > allowing the dirty pages to raise or drop within an acceptable range > as the underlying IO completion rate fluctuates up or down. > > The proposed scheme is to allow the dirty pages to float within range > (thresh - thresh/4, thresh), targeting the average pages at near > (thresh - thresh/8). > > I observed that if keeping the dirty rate fixed at the theoretic > average bdi write bandwidth, the fluctuation of dirty pages are > bounded by (bdi write bandwidth * 1 second) for all major local > filesystems and simple dd workloads. So if the machine has adequately > large memory, it's in theory able to achieve flat write() progress. > > I'm not able to get the perfect smoothness, however in some cases it's > close: > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-4dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-14-35/balance_dirty_pages-bandwidth.png > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/xfs-4dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-17/balance_dirty_pages-bandwidth.png > > In the bandwidth graph: > > write bandwidth - disk write bandwidth > avg bandwidth - smoothed "write bandwidth" > task bandwidth - task throttle bandwidth, the rate a dd task is allowed to dirty pages > base bandwidth - base throttle bandwidth, a per-bdi base value for computing task throttle bandwidth > > The "task throttle bandwidth" is what will directly impact individual dirtier > tasks. It's calculated from > > (1) the base throttle bandwidth > > (2) the level of dirty pages > - if the number of dirty pages is equal to the control target > (thresh - thresh / 8), then just use the base bandwidth > - otherwise use higher/lower bandwidth to drive the dirty pages > towards the target > - ...omitting more rules in dirty_throttle_bandwidth()... > > (3) the task's dirty weight > a light dirtier has smaller weight and will be honored quadratic Sorry it's not "quadratic", but sqrt(). > larger throttle bandwidth > > The base throttle bandwidth should be equal to average bdi write > bandwidth when there is one dd, and scaled down by 1/(N*sqrt(N)) when > there are N dd writing to 1 bdi in the system. In a realistic file > server, there will be N tasks at _different_ dirty rates, in which > case it's virtually impossible to track and calculate the right value. > > So the base throttle bandwidth is by far the most important and > hardest part to control. It's required to > > - quickly adapt to the right value, otherwise the dirty pages will be > hitting the top or bottom boundaries; > > - and stay rock stable there for a stable workload, as its fluctuation > will directly impact all tasks writing to that bdi > > Looking at the graphs, I'm pleased to say the above requirements are > met in not only the memory bounty cases, but also the much harder low > memory and JBOD cases. It's achieved by the rigid update policies in > bdi_update_throttle_bandwidth(). [to be continued tomorrow] The bdi base throttle bandwidth is updated based on three class of parameters. (1) level of dirty pages We try to avoid updating the base bandwidth whenever possible. The main update criteria are based on the level of dirty pages, when - the dirty pages are nearby the up or low control scope, or - the dirty pages are departing from the global/bdi dirty goals it's time to update the base bandwidth. Because the dirty pages are fluctuating steadily, we try to avoid disturbing the base bandwidth when the smoothed number of dirty pages is within (write bandwidth / 8) distance to the goal, based on the fact that fluctuations are typically bounded by the write bandwidth. (2) the position bandwidth The position bandwidth is equal to the base bandwidth if the dirty number is equal to the dirty goal, and will be scaled up/down when the dirty pages grow larger than or drop below the goal. When it's decided to update the base bandwidth, the delta between base bandwidth and position bandwidth will be calculated. The delta value will be scaled down at least 8 times, and the smaller delta value, the more it will be shrank. It's then added to the base bandwidth. In this way, the base bandwidth will adapt to the position bandwidth fast when there are large gaps, and remain stable when the gap is small enough. The delta is scaled down considerably because the position bandwidth is not very reliable. It fluctuates sharply when the dirty pages hit the up/low limits. And it takes time for the dirty pages to return to the goal even when the base bandwidth has be adjusted to the right value. So if tracking the position bandwidth closely, the base bandwidth could be overshot. (3) the reference bandwidth It's the theoretic base bandwidth! I take time to calculate it as a reference value of base bandwidth to eliminate the fast-convergence vs. steady-state-stability dilemma in pure position based control. It would be optimal control if used directly, however the reference bandwidth is not directly used as the base bandwidth because the numbers for calculating it are all fluctuating, and it's not acceptable for the base bandwidth to fluctuate in the plateau state. So the roughly-accurate calculated value is now used as a very useful double limit when updating the base bandwidth. Now you should be able to understand the information rich balance_dirty_pages-pages.png graph. Here are two nice ones: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/btrfs-16dd-1M-8p-3927M-60%-2.6.38-rc6-dt6+-2011-02-24-23-14/balance_dirty_pages-pages.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10HDD-JBOD-6G-6%25/xfs-1dd-1M-16p-5904M-6%25-2.6.38-rc5-dt6+-2011-02-21-20-00/balance_dirty_pages-pages.png Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html