On Oct 2, 2017, at 10:58 PM, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> wrote: > > On 02.10.2017 22:54, Linus Torvalds wrote: >> On Mon, Oct 2, 2017 at 2:54 AM, Konstantin Khlebnikov >> <khlebnikov@xxxxxxxxxxxxxx> wrote: >>> >>> This patch implements write-behind policy which tracks sequential writes >>> and starts background writeback when have enough dirty pages in a row. >> This looks lovely to me. >> I do wonder if you also looked at finishing the background >> write-behind at close() time, because it strikes me that once you >> start doing that async writeout, it would probably be good to make >> sure you try to do the whole file. > > Smaller files or tails is lesser problem and forced writeback here > might add bigger overhead due to small requests or too random IO. > Also open+append+close pattern could generate too much IO. > >> I'm thinking of filesystems that do delayed allocation etc - I'd >> expect that you'd want the whole file to get allocated on disk >> together, rather than have the "first 256kB aligned chunks" allocated >> thanks to write-behind, and then the final part allocated much later >> (after other files may have triggered their own write-behind). Think >> loads like copying lots of pictures around, for example. > > As far as I know ext4 preallocates space beyond file end for writing > patterns like append + fsync. Thus allocated extents should be bigger > than 256k. I haven't looked into this yet. > >> I don't have any particularly strong feelings about this, but I do >> suspect that once you have started that IO, you do want to finish it >> all up as the file write is done. No? > > I'm aiming into continuous file operations like downloading huge file > or writing verbose log. Original motivation came from low-latency server > workloads which suffers from parallel bulk operations which generates > tons of dirty pages. Probably for general-purpose usage thresholds > should be increased significantly to cover only really bulky patterns. > >> It would also be really nice to see some numbers. Perhaps a comparison >> of "vmstat 1" or similar when writing a big file to some slow medium >> like a USB stick (which is something we've done very very badly at, >> and this should help smooth out)? > > I'll try to find out some real cases with numbers. > > For now I see that massive write + fdatasync (dd conf=fdatasync, fio) > always ends earlier because writeback now starts earlier too. > Without fdatasync it's obviously slower. > > Cp to usb stick + umount should show same result, plus cp could be > interrupted at any point without contaminating cache with dirty pages. > > Kernel compilation tooks almost the same time because most files are > smaller than 256k. For what it's worth, Lustre clients have been doing "early writes" forever, when at least a full/contiguous RPC worth (1MB) of dirty data is available, because network bandwidth is a terrible thing to waste. The oft-cited case of "app writes to a file that only lives a few seconds on disk before it is deleted" is IMHO fairly rare in real life, mostly dbench and back in the days of disk based /tmp. Delaying data writes for large files means that 30s * bandwidth of data could have been written before VM page aging kicks in, unless memory pressure causes writeout first. With fast devices/networks, this might be many GB of data filling up memory that could have been written out. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP