Re: [PATCH v2] mm: implement write-behind policy for sequential file writes

Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> · Fri, 20 Sep 2019 10:39:33 +0300

Script for trivial demo in attachment

$ bash test_writebehind.sh
SIZE
3,2G	dummy
vm.dirty_write_behind = 0
COPY

real	0m3.629s
user	0m0.016s
sys	0m3.613s
Dirty:           3254552 kB
SYNC

real	0m31.953s
user	0m0.002s
sys	0m0.000s
vm.dirty_write_behind = 1
COPY

real	0m32.738s
user	0m0.008s
sys	0m4.047s
Dirty:              2900 kB
SYNC

real	0m0.427s
user	0m0.000s
sys	0m0.004s
vm.dirty_write_behind = 2
COPY

real	0m32.168s
user	0m0.000s
sys	0m4.066s
Dirty:              3088 kB
SYNC

real	0m0.421s
user	0m0.004s
sys	0m0.001s

With vm.dirty_write_behind 1 or 2 files are written even faster and
during copying amount of dirty memory always stays around at 16MiB.

On 20/09/2019 10.35, Konstantin Khlebnikov wrote:
Traditional writeback tries to accumulate as much dirty data as possible.
This is worth strategy for extremely short-living files and for batching
writes for saving battery power. But for workloads where disk latency is
important this policy generates periodic disk load spikes which increases
latency for concurrent operations.

Also dirty pages in file cache cannot be reclaimed and reused immediately.
This way massive I/O like file copying affects memory allocation latency.

Present writeback engine allows to tune only dirty data size or expiration
time. Such tuning cannot eliminate spikes - this just lowers and multiplies
them. Other option is switching into sync mode which flushes written data
right after each write, obviously this have significant performance impact.
Such tuning is system-wide and affects memory-mapped and randomly written
files, flusher threads handle them much better.

This patch implements write-behind policy which tracks sequential writes
and starts background writeback when file have enough dirty pages.

Global switch in sysctl vm.dirty_write_behind:
=0: disabled, default
=1: enabled for strictly sequential writes (append, copying)
=2: enabled for all sequential writes

The only parameter is window size: maximum amount of dirty pages behind
current position and maximum amount of pages in background writeback.

Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb.
Default: 16MiB, '0' disables write-behind for this disk.

When amount of unwritten pages exceeds window size write-behind starts
background writeback for max(excess, max_sectors_kb) and then waits for
the same amount of background writeback initiated at previously.

  |<-wait-this->|           |<-send-this->|<---pending-write-behind--->|
  |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
               current head-^    new head-^              file position-^

Remaining tail pages are flushed at closing file if async write-behind was
started or this is new file and it is at least max_sectors_kb long.

Overall behavior depending on total data size:
< max_sectors_kb - no writes
max_sectors_kb - write new files in background after close
write_behind_kb - streaming write, write tail at close

Special cases:

* files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored

* writing cursor for O_APPEND is aligned to covers previous small appends
   Append might happen via multiple files or via new file each time.

* mode vm.dirty_write_behind=1 ignores non-append writes
   This reacts only to completely sequential writes like copying files,
   writing logs with O_APPEND or rewriting files after O_TRUNC.

Note: ext4 feature "auto_da_alloc" also writes cache at closing file
after truncating it to 0 and after renaming one file over other.

Changes since v1 (2017-10-02):
* rework window management:
* change default window 1MiB -> 16MiB
* change default request 256KiB -> max_sectors_kb
* drop always-async behavior for O_NONBLOCK
* drop handling POSIX_FADV_NOREUSE (should be in separate patch)
* ignore writes with O_DIRECT, O_SYNC, O_DSYNC
* align head position for O_APPEND
* add strictly sequential mode
* write tail pages for new files
* make void, keep errors at mapping

Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>
Link: https://lore.kernel.org/patchwork/patch/836149/ (v1)
---
Attachment:
test_writebehind.sh

Description: application/shellscript