On 21/09/16 05:48PM, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Recently, I've been observing very high runtimes of tests that format a > filesystem atop a loop device and write enough data to fill memory, such > as generic/590 and generic/361. Logging into the test VMs, I noticed > that the writes to the file on the upper filesystem started fast, but > soon slowed down to about 500KB/s and stayed that way for nearly 20 > minutes. Looking through the D-state processes on the system revealed: > > /proc/4350/comm = xfs_io > /proc/4350/stack : [<0>] balance_dirty_pages+0x332/0xda0 > [<0>] balance_dirty_pages_ratelimited+0x304/0x400 > [<0>] iomap_file_buffered_write+0x1ab/0x260 > [<0>] xfs_file_buffered_write+0xba/0x330 [xfs] > [<0>] new_sync_write+0x119/0x1a0 > [<0>] vfs_write+0x274/0x310 > [<0>] __x64_sys_pwrite64+0x89/0xc0 > [<0>] do_syscall_64+0x35/0x80 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae > > Here's the xfs_io process performing a buffered write to the file on the > upper filesystem, which at this point has dirtied enough pages to be > ratelimited. > > /proc/28/comm = u10:0+flush-8:80 > /proc/28/stack : [<0>] blk_mq_get_tag+0x11c/0x280 > [<0>] __blk_mq_alloc_request+0xce/0xf0 > [<0>] blk_mq_submit_bio+0x139/0x5b0 > [<0>] submit_bio_noacct+0x3ba/0x430 > [<0>] iomap_submit_ioend+0x4b/0x70 > [<0>] xfs_vm_writepages+0x86/0x170 [xfs] > [<0>] do_writepages+0xcc/0x200 > [<0>] __writeback_single_inode+0x3d/0x300 > [<0>] writeback_sb_inodes+0x207/0x4a0 > [<0>] __writeback_inodes_wb+0x4c/0xe0 > [<0>] wb_writeback+0x1da/0x2c0 > [<0>] wb_workfn+0x2ad/0x4f0 > [<0>] process_one_work+0x1e2/0x3d0 > [<0>] worker_thread+0x53/0x3c0 > [<0>] kthread+0x149/0x170 > [<0>] ret_from_fork+0x1f/0x30 > > This is a flusher thread that has invoked writeback on the upper > filesystem to try to clean memory pages. > > /proc/89/comm = u10:7+loop0 > /proc/89/stack : [<0>] balance_dirty_pages+0x332/0xda0 > [<0>] balance_dirty_pages_ratelimited+0x304/0x400 > [<0>] iomap_file_buffered_write+0x1ab/0x260 > [<0>] xfs_file_buffered_write+0xba/0x330 [xfs] > [<0>] do_iter_readv_writev+0x14f/0x1a0 > [<0>] do_iter_write+0x7b/0x1c0 > [<0>] lo_write_bvec+0x62/0x1c0 > [<0>] loop_process_work+0x3a4/0xba0 > [<0>] process_one_work+0x1e2/0x3d0 > [<0>] worker_thread+0x53/0x3c0 > [<0>] kthread+0x149/0x170 > [<0>] ret_from_fork+0x1f/0x30 > > Here's the loop device worker handling the writeback IO submitted by the > flusher thread. Unfortunately, the loop device is using buffered write > mode, which means that /writeback/ is dirtying pages and being throttled > for that. This is stupid. > > Fix this by trying to enable "directio" mode on the loop device, which > delivers two performance benefits: setting directio mode also enables > async io mode, which will allow multiple IOs at once; and using directio > nearly eliminates the chance that writeback will get throttled. > > On the author's system with fast storage, this reduces the runtime of > g/590 from 20 minutes to 12 seconds, and g/361 from ~30s to ~3s. > Above observations looked interesting to me. I then happen to test it on my setup and also observe such behavior. But I really don't see the timings improvements as such. Both before and after the test g/590 took 90-95secs. I did verify that after this patch applied, I was mostly seeing DIO path for writes to ext4 FS. Note: I had ext4 as my root filesystem where fs images were kept for testfs and scratchfs. <pasting call stack observed> PID: 6010 TASK: c000000030833380 CPU: 18 COMMAND: "kworker/u64:10" #0 [c00000002a787480] __schedule at c00000000139ba50 #1 [c00000002a787550] schedule at c00000000139bcc4 #2 [c00000002a787580] schedule_timeout at c0000000013a42bc #3 [c00000002a787650] io_schedule_timeout at c00000000139abe8 #4 [c00000002a787680] balance_dirty_pages_ratelimited at c0000000004b01c0 #5 [c00000002a787810] generic_perform_write at c0000000004965bc #6 [c00000002a787900] ext4_buffered_write_iter at c00000000077f3e4 #7 [c00000002a787950] do_iter_readv_writev at c0000000005ef3a0 #8 [c00000002a7879c0] do_iter_write at c0000000005f2568 #9 [c00000002a787a10] lo_write_bvec at c000000000e506e8 #10 [c00000002a787ad0] loop_process_work at c000000000e523c4 #11 [c00000002a787c40] process_one_work at c00000000020e8d0 #12 [c00000002a787d10] worker_thread at c00000000020ef7c #13 [c00000002a787da0] kthread at c00000000021cb74 #14 [c00000002a787e10] ret_from_kernel_thread at c00000000000cfd4 PID: 20175 TASK: c000000031de9980 CPU: 23 COMMAND: "kworker/u64:8" #0 [c000000023967360] __schedule at c00000000139ba50 #1 [c000000023967430] schedule at c00000000139bcc4 #2 [c000000023967460] schedule_timeout at c0000000013a42bc #3 [c000000023967530] io_schedule_timeout at c00000000139abe8 #4 [c000000023967560] balance_dirty_pages_ratelimited at c0000000004b01c0 #5 [c0000000239676f0] iomap_file_buffered_write at c0000000006e3018 #6 [c0000000239678a0] xfs_file_buffered_write at c0000000009d9d20 #7 [c000000023967950] do_iter_readv_writev at c0000000005ef3a0 #8 [c0000000239679c0] do_iter_write at c0000000005f2568 #9 [c000000023967a10] lo_write_bvec at c000000000e506e8 #10 [c000000023967ad0] loop_process_work at c000000000e523c4 #11 [c000000023967c40] process_one_work at c00000000020e8d0 #12 [c000000023967d10] worker_thread at c00000000020ef7c #13 [c000000023967da0] kthread at c00000000021cb74 #14 [c000000023967e10] ret_from_kernel_thread at c00000000000cfd4 PID: 23482 TASK: c000000008e3cc80 CPU: 1 COMMAND: "kworker/u64:3" #0 [c0000000297c7160] __schedule at c00000000139ba50 #1 [c0000000297c7230] schedule at c00000000139bcc4 #2 [c0000000297c7260] io_schedule at c00000000139bdd4 #3 [c0000000297c7290] blk_mq_get_tag at c000000000c4483c #4 [c0000000297c7320] __blk_mq_alloc_request at c000000000c3b038 #5 [c0000000297c7360] blk_mq_submit_bio at c000000000c403dc #6 [c0000000297c7410] submit_bio_noacct at c000000000c2cc38 #7 [c0000000297c74c0] iomap_do_writepage at c0000000006e2308 #8 [c0000000297c75a0] write_cache_pages at c0000000004b0fb4 #9 [c0000000297c76e0] iomap_writepages at c0000000006df2bc #10 [c0000000297c7710] xfs_vm_writepages at c0000000009c2f2c #11 [c0000000297c77d0] do_writepages at c0000000004b28e0 #12 [c0000000297c7890] __writeback_single_inode at c00000000064ea4c #13 [c0000000297c7900] writeback_sb_inodes at c00000000064f954 #14 [c0000000297c79f0] __writeback_inodes_wb at c00000000064fdc4 #15 [c0000000297c7a50] wb_writeback at c0000000006500fc #16 [c0000000297c7b00] wb_workfn at c000000000651908 #17 [c0000000297c7c40] process_one_work at c00000000020e8d0 #18 [c0000000297c7d10] worker_thread at c00000000020ef7c #19 [c0000000297c7da0] kthread at c00000000021cb74 #20 [c0000000297c7e10] ret_from_kernel_thread at c00000000000cfd4 PID: 26082 TASK: c000000029740080 CPU: 3 COMMAND: "xfs_io" #0 [c000000031937610] __schedule at c00000000139ba50 #1 [c0000000319376e0] schedule at c00000000139bcc4 #2 [c000000031937710] schedule_timeout at c0000000013a42bc #3 [c0000000319377e0] io_schedule_timeout at c00000000139abe8 #4 [c000000031937810] balance_dirty_pages_ratelimited at c0000000004b01c0 #5 [c0000000319379a0] iomap_file_buffered_write at c0000000006e3018 #6 [c000000031937b50] xfs_file_buffered_write at c0000000009d9d20 #7 [c000000031937c00] new_sync_write at c0000000005ef194 #8 [c000000031937ca0] vfs_write at c0000000005f3bb4 #9 [c000000031937cf0] sys_pwrite64 at c0000000005f3f3c #10 [c000000031937d50] system_call_exception at c000000000035ba0 #11 [c000000031937e10] system_call_common at c00000000000c764 -ritesh > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > --- > common/rc | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/common/rc b/common/rc > index 275b1f24..a174b695 100644 > --- a/common/rc > +++ b/common/rc > @@ -3849,6 +3849,14 @@ _create_loop_device() > { > local file=$1 dev > dev=`losetup -f --show $file` || _fail "Cannot assign $file to a loop device" > + > + # Try to enable asynchronous directio mode on the loopback device so > + # that writeback started by a filesystem mounted on the loop device > + # won't be throttled by buffered writes to the lower filesystem. This > + # is a performance optimization for tests that want to write a lot of > + # data, so it isn't required to work. > + test -b "$dev" && losetup --direct-io=on $dev 2> /dev/null > + > echo $dev > } >