Andrew, I'm glad to release this extensively tested v3 IO-less dirty throttling patchset. It's based on 2.6.37-rc5 and Jan's sync livelock patches. Given its trickiness and possibility of side effects, independent tests are highly welcome. Here is the git tree for easy access git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v3 Andrew, I followed your suggestion to add some trace points, and goes further to write scripts to do automated tests and to visualize the collected trace, iostat and vmstat data. The help is tremendous. The tests and data analyzes pave way to many fixes and algorithm improvements. It still took long time. The most challenging tasks are the fluctuations on 100+ dd and on NFS, and various imperfections in the control system and in many filesystems. I'd say I won't be able to go this far without the help of the pretty graphs. And I believe they'll continue to make future maintenance easy. To identify problems reported by the end users, just ask for the traces, I'll then turn them into graphs and quickly get an overview of the problem. The most up-to-date graphs and the corresponding scripts are uploaded to http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests Here you may find and compare test results for this patchset (2.6.37-rc5+) and for vanilla kernel (2.6.37-rc5). Filesystem developers may be interested to take a look at the dynamics. The control algorithms are generally doing good in the recent graphs. There are regular fluctuations of the dirty pages number, however they are mostly originated from underneath: the low level is reporting IO completion in units of 1MB, 32MB or even more, leading to sudden drops of the dirty pages. The tests cover the common scenarios - ext2, ext3, ext4, xfs, btrfs, nfs - 256M, 512M, 3G, 16G memory sizes - single disk and 12-disk array - 1, 2, 10, 100, 1000 concurrent dd's They disclose lots of imperfections and bugs of 1) this patchset 2) file system not working well with the new paradigm 3) file system problems also exist in vanilla kernel I managed to fix case (1) and most of (2) and report (3). Below are some interesting graphs illustrating the problems. BTRFS case (3) problem, nr_dirty going all the way down to 0, fixed by [PATCH 38/47] btrfs: wait on too many nr_async_bios http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-17/vmstat-dirty.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat-dirty.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png after fix http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png case (3) problem, not good looking but otherwise harmless, not fixed yet http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-14/vmstat-written.png root cause is btrfs always clear page dirty in the end of prepare_pages() and then to set it dirty again in dirty_and_release_pages(). This leads to duplicate dirty accounting on 1KB-size writes. case (3) problem, bdi limit exceeded on 10+ concurrent dd's, fixed by [PATCH 37/47] btrfs: lower the dirty balacing rate limit http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/vmstat-dirty.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/dirty-pages.png case (2) problem, not root caused yet in vanilla kernel, the dirty/writeback pages are interesting http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/vmstat-dirty.png but performance is still excellent http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/iostat-bw.png with IO-less balance_dirty_pages(), it's much more slow http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-bw.png dirty pages go very low http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/vmstat-dirty.png with only 20% disk util http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-util.png EXT4 case (3) problem, maybe memory leak, not root caused yet http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/ext4-100dd-1M-24p-15976M-2.6.37-rc5+-2010-12-09-23-40/dirty-pages.png case (3) problem, burst-of-redirty, known issue with data=ordered, would be non-trivial to fix http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages-3000.png the workaround now is to mount with data=writeback http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4_wb-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-12-13-40/dirty-pages.png NFS There are some hard problems - large fluctuations of everything - writeback/unstable pages squeezing dirty pages - sometimes it may stall the dirtiers for 1-2 seconds because no COMMITs return during the time, hard to fix in the client side before the patches http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-11-10-31/vmstat-dirty.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-12-40/vmstat-dirty.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/vmstat-dirty.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/dirty-bandwidth.png after patches http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/vmstat-dirty.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/dirty-bandwidth-3000.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty.png burst of commit submits/returns http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png after fix http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png The 1-second stall happens at around 317s and 321s. Fortunately it only happens for 10+ concurrent dd's, which is not typical NFS client workloads. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png XFS performs mostly ideal, except for some trivial imperfections: somewhere the lines are not straight. dirty/writeback pages http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-18-18/vmstat-dirty.png avg queue size and wait time http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/iostat-misc.png bandwidth http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/dirty-bandwidth.png Changes from v2 <http://lkml.org/lkml/2010/11/16/728> - lock protected bdi bandwidth estimation - user space think time compensation - raise max pause time to 200ms for lower CPU overheads on concurrent dirtiers - control system enhancements to handle large pause time and huge number of tasks - concurrent dd test suite and a lot of tests - adaptive scale up writeback chunk size - make it right for small memory systems - various bug fixes - new trace points Changes from initial RFC <http://thread.gmane.org/gmane.linux.kernel.mm/52966> - adaptive rate limiting, to reduce overheads when under throttle threshold - prevent overrunning dirty limit on lots of concurrent dirtiers - add Documentation/filesystems/writeback-throttling-design.txt - lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy - don't drop the laptop mode code - update and comment the trace event - benchmarks on concurrent dd and fs_mark covering both large and tiny files - bdi->write_bandwidth updates should be rate limited on concurrent dirtiers, otherwise it will drift fast and fluctuate - don't call balance_dirty_pages_ratelimit() when writing to already dirtied pages, otherwise the task will be throttled too much bdi dirty limit fixes [PATCH 01/47] writeback: enabling gate limit for light dirtied bdi [PATCH 02/47] writeback: safety margin for bdi stat error v2 patches rebased onto the above two fixes [PATCH 03/47] writeback: IO-less balance_dirty_pages() [PATCH 04/47] writeback: consolidate variable names in balance_dirty_pages() [PATCH 05/47] writeback: per-task rate limit on balance_dirty_pages() [PATCH 06/47] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls [PATCH 07/47] writeback: account per-bdi accumulated written pages [PATCH 08/47] writeback: bdi write bandwidth estimation [PATCH 09/47] writeback: show bdi write bandwidth in debugfs [PATCH 10/47] writeback: quit throttling when bdi dirty pages dropped low [PATCH 11/47] writeback: reduce per-bdi dirty threshold ramp up time [PATCH 12/47] writeback: make reasonable gap between the dirty/background thresholds [PATCH 13/47] writeback: scale down max throttle bandwidth on concurrent dirtiers [PATCH 14/47] writeback: add trace event for balance_dirty_pages() [PATCH 15/47] writeback: make nr_to_write a per-file limit trivial fixes for v2 [PATCH 16/47] writeback: make-nr_to_write-a-per-file-limit fix [PATCH 17/47] writeback: do uninterruptible sleep in balance_dirty_pages() [PATCH 18/47] writeback: move BDI_WRITTEN accounting into __bdi_writeout_inc() [PATCH 19/47] writeback: fix increasement of nr_dirtied_pause [PATCH 20/47] writeback: use do_div in bw calculation [PATCH 21/47] writeback: prevent divide error on tiny HZ [PATCH 22/47] writeback: prevent bandwidth calculation overflow spinlock protected bandwidth estimation, as suggested by Peter [PATCH 23/47] writeback: spinlock protected bdi bandwidth update algorithm updates [PATCH 24/47] writeback: increase pause time on concurrent dirtiers [PATCH 25/47] writeback: make it easier to break from a dirty exceeded bdi [PATCH 26/47] writeback: start background writeback earlier [PATCH 27/47] writeback: user space think time compensation [PATCH 28/47] writeback: bdi base throttle bandwidth [PATCH 29/47] writeback: smoothed bdi dirty pages [PATCH 30/47] writeback: adapt max balance pause time to memory size [PATCH 31/47] writeback: increase min pause time on concurrent dirtiers trace points [PATCH 32/47] writeback: extend balance_dirty_pages() trace event [PATCH 33/47] writeback: trace global dirty page states [PATCH 34/47] writeback: trace writeback_single_inode() larger writeback chunk size [PATCH 35/47] writeback: scale IO chunk size up to device bandwidth btrfs fixes [PATCH 36/47] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages [PATCH 37/47] btrfs: lower the dirty balacing rate limit [PATCH 38/47] btrfs: wait on too many nr_async_bios nfs fixes [PATCH 39/47] nfs: livelock prevention is now done in VFS [PATCH 40/47] NFS: writeback pages wait queue [PATCH 41/47] nfs: in-commit pages accounting and wait queue [PATCH 42/47] nfs: heuristics to avoid commit [PATCH 43/47] nfs: dont change wbc->nr_to_write in write_inode() [PATCH 44/47] nfs: limit the range of commits [PATCH 45/47] nfs: adapt congestion threshold to dirty threshold [PATCH 46/47] nfs: trace nfs_commit_unstable_pages() [PATCH 47/47] nfs: trace nfs_commit_release() Documentation/filesystems/writeback-throttling-design.txt | 210 ++++ fs/btrfs/disk-io.c | 7 fs/btrfs/file.c | 16 fs/btrfs/ioctl.c | 6 fs/btrfs/relocation.c | 6 fs/fs-writeback.c | 85 + fs/nfs/client.c | 3 fs/nfs/file.c | 9 fs/nfs/write.c | 241 +++- include/linux/backing-dev.h | 9 include/linux/nfs_fs.h | 1 include/linux/nfs_fs_sb.h | 3 include/linux/sched.h | 8 include/linux/writeback.h | 26 include/trace/events/nfs.h | 89 + include/trace/events/writeback.h | 195 +++ mm/backing-dev.c | 32 mm/filemap.c | 5 mm/memory_hotplug.c | 3 mm/page-writeback.c | 518 +++++++--- 20 files changed, 1212 insertions(+), 260 deletions(-) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html