On 10/02/2017 03:54 AM, Konstantin Khlebnikov wrote: > Traditional writeback tries to accumulate as much dirty data as possible. > This is worth strategy for extremely short-living files and for batching > writes for saving battery power. But for workloads where disk latency is > important this policy generates periodic disk load spikes which increases > latency for concurrent operations. > > Present writeback engine allows to tune only dirty data size or expiration > time. Such tuning cannot eliminate pikes - this just lowers and multiplies > them. Other option is switching into sync mode which flushes written data > right after each write, obviously this have significant performance impact. > Such tuning is system-wide and affects memory-mapped and randomly written > files, flusher threads handle them much better. > > This patch implements write-behind policy which tracks sequential writes > and starts background writeback when have enough dirty pages in a row. This is a great idea in general. My only concerns would be around cases where we don't expect the writes to ever make it to media. It's not an uncommon use case - app dirties some memory in a file, and expects to truncate/unlink it before it makes it to disk. We don't want to trigger writeback for those. Arguably that should be app hinted. > Write-behind tracks current writing position and looks into two windows > behind it: first represents unwitten pages, Second - async writeback. > > Next write starts background writeback when first window exceed threshold > and waits for pages falling behind async writeback window. This allows to > combine small writes into bigger requests and maintain optimal io-depth. > > This affects only writes via syscalls, memory mapped writes are unchanged. > Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM. > > If async window set to 0 then write-behind skips dirty pages for congested > disk and never wait for writeback. This is used for files with O_NONBLOCK. > > Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically > evicts completely written pages from cache. This is perfect for writing > verbose logs without pushing more important data out of cache. > > As a bonus write-behind makes blkio throttling much more smooth for most > bulk file operations like copying or downloading which writes sequentially. > > Size of minimal write-behind request is set in: > /sys/block/$DISK/bdi/min_write_behind_kb > Default is 256Kb, 0 - disable write-behind for this disk. > > Size of async window set in: > /sys/block/$DISK/bdi/async_write_behind_kb > Default is 1024Kb, 0 - disables sync write-behind. Should we expose these, or just make them a function of the IO limitations exposed by the device? Something like 2x max request size, or similar. Finally, do you have any test results? -- Jens Axboe