Script for trivial demo in attachment $ bash test_writebehind.sh SIZE 3,2G dummy vm.dirty_write_behind = 0 COPY real 0m3.629s user 0m0.016s sys 0m3.613s Dirty: 3254552 kB SYNC real 0m31.953s user 0m0.002s sys 0m0.000s vm.dirty_write_behind = 1 COPY real 0m32.738s user 0m0.008s sys 0m4.047s Dirty: 2900 kB SYNC real 0m0.427s user 0m0.000s sys 0m0.004s vm.dirty_write_behind = 2 COPY real 0m32.168s user 0m0.000s sys 0m4.066s Dirty: 3088 kB SYNC real 0m0.421s user 0m0.004s sys 0m0.001s With vm.dirty_write_behind 1 or 2 files are written even faster and during copying amount of dirty memory always stays around at 16MiB. On 20/09/2019 10.35, Konstantin Khlebnikov wrote:
Traditional writeback tries to accumulate as much dirty data as possible. This is worth strategy for extremely short-living files and for batching writes for saving battery power. But for workloads where disk latency is important this policy generates periodic disk load spikes which increases latency for concurrent operations. Also dirty pages in file cache cannot be reclaimed and reused immediately. This way massive I/O like file copying affects memory allocation latency. Present writeback engine allows to tune only dirty data size or expiration time. Such tuning cannot eliminate spikes - this just lowers and multiplies them. Other option is switching into sync mode which flushes written data right after each write, obviously this have significant performance impact. Such tuning is system-wide and affects memory-mapped and randomly written files, flusher threads handle them much better. This patch implements write-behind policy which tracks sequential writes and starts background writeback when file have enough dirty pages. Global switch in sysctl vm.dirty_write_behind: =0: disabled, default =1: enabled for strictly sequential writes (append, copying) =2: enabled for all sequential writes The only parameter is window size: maximum amount of dirty pages behind current position and maximum amount of pages in background writeback. Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb. Default: 16MiB, '0' disables write-behind for this disk. When amount of unwritten pages exceeds window size write-behind starts background writeback for max(excess, max_sectors_kb) and then waits for the same amount of background writeback initiated at previously. |<-wait-this->| |<-send-this->|<---pending-write-behind--->| |<--async-write-behind--->|<--------previous-data------>|<-new-data->| current head-^ new head-^ file position-^ Remaining tail pages are flushed at closing file if async write-behind was started or this is new file and it is at least max_sectors_kb long. Overall behavior depending on total data size: < max_sectors_kb - no writesmax_sectors_kb - write new files in background after close write_behind_kb - streaming write, write tail at closeSpecial cases: * files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored * writing cursor for O_APPEND is aligned to covers previous small appends Append might happen via multiple files or via new file each time. * mode vm.dirty_write_behind=1 ignores non-append writes This reacts only to completely sequential writes like copying files, writing logs with O_APPEND or rewriting files after O_TRUNC. Note: ext4 feature "auto_da_alloc" also writes cache at closing file after truncating it to 0 and after renaming one file over other. Changes since v1 (2017-10-02): * rework window management: * change default window 1MiB -> 16MiB * change default request 256KiB -> max_sectors_kb * drop always-async behavior for O_NONBLOCK * drop handling POSIX_FADV_NOREUSE (should be in separate patch) * ignore writes with O_DIRECT, O_SYNC, O_DSYNC * align head position for O_APPEND * add strictly sequential mode * write tail pages for new files * make void, keep errors at mapping Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> Link: https://lore.kernel.org/patchwork/patch/836149/ (v1) ---
Attachment:
test_writebehind.sh
Description: application/shellscript