On Mon, Oct 02, 2017 at 12:54:53PM -0700, Linus Torvalds wrote: > On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov > <khlebnikov@xxxxxxxxxxxxxx> wrote: > > > > This patch implements write-behind policy which tracks sequential writes > > and starts background writeback when have enough dirty pages in a row. > > This looks lovely to me. Yup, it's a good idea. Needs some tweaking, though. > I do wonder if you also looked at finishing the background > write-behind at close() time, because it strikes me that once you > start doing that async writeout, it would probably be good to make > sure you try to do the whole file. Inserting arbitrary pipeline bubbles is never good for performance. Think untar: create file, write data, create file, write data, create .... With async write-behind, it pipelines like this: create, write create, write create, write create, write If we block on close, it becomes: create, write, wait create, write, wait create, write, wait Basically performance of things like cp, untar, etc will suck badly if we wait for write behind on close() - it's essentially the same thing as forcing these apps to fdatasync() after every file is written.... > I'm thinking of filesystems that do delayed allocation etc - I'd > expect that you'd want the whole file to get allocated on disk > together, rather than have the "first 256kB aligned chunks" allocated > thanks to write-behind, and then the final part allocated much later > (after other files may have triggered their own write-behind). Think > loads like copying lots of pictures around, for example. Yeah, this is going to completely screw up delayed allocation because it doesn't allow time to aggregate large extents in memory before writeback and allocation occurs. Compared to above, the untar behaviour for the existing writeback mode is: create, create, create, create ...... write, write, write, write i.e. the data writeback is completely decoupled from the creation of files. With delalloc, this means all the directory and inodes are created in the syscall context, all closely packed together on disk, and once that is done the data writeback starts allocating file extents, which then get allocated as single extents and get packed tightly together to give cross-file sequential write behaviour and minimal seeks. And when the metadata gets written back, the internal XFS algorithms will sort that all into sequentially issued IO that gets merged into large sequential IO, too, further minimising seeks. IOWs, what ends up on disk is: <metadata><lots of file data in large contiguous extents> With writebehind, we'll end up with "alloc metadata, alloc data" for each file being written, which will result in this sort of thing for an untar: <m><ddd><m><d><m><ddddddd><m><ddd> ..... It also means we can no longer do cross-file sequentialisation to minimise seeks on writeback, and metadata writeback will turn into a massive seek fest instead of being nicely aggregated into large writes. If we consider large concurrent sequential writes, we have heuristics in the delalloc code to get them built into larger-than-necessary delalloc extents so we typically end up on disk with very large data extents in each file (hundreds of MB to 8GB each) such that on disk we end up with: <m><FILE 1 EXT 1><FILE 2 EXT 1><FILE 1 EXT 2> .... With writebehind, we'll be doing allocation every MB, so end up with lots of small interleaved extents on disk like this: <m><f1><f2><f3><f1><f3><f2><f1><f2><f3><f1> .... (FYI, this is what happened with these worklaods prior to all the additiona of the speculative delalloc heuristics we have now.) Now this won't affect overall write speed, but when we go to read the file back, we have to seek every 1MB IO. IOWs, we might get the same write performance because we're still doing sequential writes, but the performance degradation is seen on the read side when we have to access that data again. Further, what happens when we free just file 1? Instead of getting back a large contiguous free space extent, we get back a heap of tiny, fragmented free spaces. IOWs, the interleaving causes the rapid onset of filesystem aging symptoms which will further degrade allocation behaviour as we'll quickly run out of large contiguous free spaces to allocate from. IOWs, rapid write-behind behaviour might not signficantly affect initial write performance on an empty filesystem. It will, in general, increase file fragmentation, increase interleaving of metadata and data, reduce metadata writeback and read performance, increase free space fragmentation, reduce data read performance and speed up the onset of aging related performance degradation. Don't get me wrong - writebehind can be very effective for some workloads. However, I think that a write-behind default of 1MB is bordering on insane because it pushes most filesystems straight into the above problems. At minimum, per-file writebehind needs to have a much higher default threshold and writeback chunk size to allow filesystems to avoid the above problems. Perhaps we need to think about a small per-backing dev threshold where the behaviour is the current writeback behaviour, but once it's exceeded we then switch to write-behind so that the amount of dirty data doesn't exceed that threshold. Make the threshold 2-3x the bdi's current writeback throughput and we've got something that should mostly self-tune to 2-3s of outstanding dirty data per backing dev whilst mostly avoiding the issues with small, strict write-behind thresholds. Cheers, Dave -- Dave Chinner david@xxxxxxxxxxxxx