On 12/18/24 10:16 AM, Mike Snitzer wrote: > On Fri, Dec 13, 2024 at 08:55:14AM -0700, Jens Axboe wrote: >> Hi, >> >> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way >> to do buffered IO that isn't page cache persistent. The approach back >> then was to have private pages for IO, and then get rid of them once IO >> was done. But that then runs into all the issues that O_DIRECT has, in >> terms of synchronizing with the page cache. >> >> So here's a new approach to the same concent, but using the page cache >> as synchronization. Due to excessive bike shedding on the naming, this >> is now named RWF_DONTCACHE, and is less special in that it's just page >> cache IO, except it prunes the ranges once IO is completed. >> >> Why do this, you may ask? The tldr is that device speeds are only >> getting faster, while reclaim is not. Doing normal buffered IO can be >> very unpredictable, and suck up a lot of resources on the reclaim side. >> This leads people to use O_DIRECT as a work-around, which has its own >> set of restrictions in terms of size, offset, and length of IO. It's >> also inherently synchronous, and now you need async IO as well. While >> the latter isn't necessarily a big problem as we have good options >> available there, it also should not be a requirement when all you want >> to do is read or write some data without caching. >> >> Even on desktop type systems, a normal NVMe device can fill the entire >> page cache in seconds. On the big system I used for testing, there's a >> lot more RAM, but also a lot more devices. As can be seen in some of the >> results in the following patches, you can still fill RAM in seconds even >> when there's 1TB of it. Hence this problem isn't solely a "big >> hyperscaler system" issue, it's common across the board. >> >> Common for both reads and writes with RWF_DONTCACHE is that they use the >> page cache for IO. Reads work just like a normal buffered read would, >> with the only exception being that the touched ranges will get pruned >> after data has been copied. For writes, the ranges will get writeback >> kicked off before the syscall returns, and then writeback completion >> will prune the range. Hence writes aren't synchronous, and it's easy to >> pipeline writes using RWF_DONTCACHE. Folios that aren't instantiated by >> RWF_DONTCACHE IO are left untouched. This means you that uncached IO >> will take advantage of the page cache for uptodate data, but not leave >> anything it instantiated/created in cache. >> >> File systems need to support this. This patchset adds support for the >> generic read path, which covers file systems like ext4. Patches exist to >> add support for iomap/XFS and btrfs as well, which sit on top of this >> series. If RWF_DONTCACHE IO is attempted on a file system that doesn't >> support it, -EOPNOTSUPP is returned. Hence the user can rely on it >> either working as designed, or flagging and error if that's not the >> case. The intent here is to give the application a sensible fallback >> path - eg, it may fall back to O_DIRECT if appropriate, or just live >> with the fact that uncached IO isn't available and do normal buffered >> IO. >> >> Adding "support" to other file systems should be trivial, most of the >> time just a one-liner adding FOP_DONTCACHE to the fop_flags in the >> file_operations struct. >> >> Performance results are in patch 8 for reads, and you can find the write >> side results in the XFS patch adding support for DONTCACHE writes for >> XFS: >> >> ://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached.9&id=edd7b1c910c5251941c6ba179f44b4c81a089019 >> >> with the tldr being that I see about a 65% improvement in performance >> for both, with fully predictable IO times. CPU reduction is substantial >> as well, with no kswapd activity at all for reclaim when using >> uncached IO. >> >> Using it from applications is trivial - just set RWF_DONTCACHE for the >> read or write, using pwritev2(2) or preadv2(2). For io_uring, same >> thing, just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write >> operation. And that's it. >> >> Patches 1..7 are just prep patches, and should have no functional >> changes at all. Patch 8 adds support for the filemap path for >> RWF_DONTCACHE reads, and patches 9..11 are just prep patches for >> supporting the write side of uncached writes. In the below mentioned >> branch, there are then patches to adopt uncached reads and writes for >> xfs, btrfs, and ext4. The latter currently relies on bit of a hack for >> passing whether this is an uncached write or not through >> ->write_begin(), which can hopefully go away once ext4 adopts iomap for >> buffered writes. I say this is a hack as it's not the prettiest way to >> do it, however it is fully solid and will work just fine. >> >> Passes full xfstests and fsx overnight runs, no issues observed. That >> includes the vm running the testing also using RWF_DONTCACHE on the >> host. I'll post fsstress and fsx patches for RWF_DONTCACHE separately. >> As far as I'm concerned, no further work needs doing here. >> >> And git tree for the patches is here: >> >> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.9 >> >> include/linux/fs.h | 21 +++++++- >> include/linux/page-flags.h | 5 ++ >> include/linux/pagemap.h | 1 + >> include/trace/events/mmflags.h | 3 +- >> include/uapi/linux/fs.h | 6 ++- >> mm/filemap.c | 97 +++++++++++++++++++++++++++++----- >> mm/internal.h | 2 + >> mm/readahead.c | 22 ++++++-- >> mm/swap.c | 2 + >> mm/truncate.c | 54 ++++++++++--------- >> 10 files changed, 166 insertions(+), 47 deletions(-) >> >> Since v6 >> - Rename the PG_uncached flag to PG_dropbehind >> - Shuffle patches around a bit, most notably so the foliop_uncached >> patch goes with the ext4 support >> - Get rid of foliop_uncached hack for btrfs (Christoph) >> - Get rid of passing in struct address_space to filemap_create_folio() >> - Inline invalidate_complete_folio2() in folio_unmap_invalidate() rather >> than keep it as a separate helper >> - Rebase on top of current master >> >> -- >> Jens Axboe >> >> > > > Hi Jens, > > You may recall I tested NFS to work with UNCACHED (now DONTCACHE). > I've rebased the required small changes, feel free to append this to > your series if you like. > > More work is needed to inform knfsd to selectively use DONTCACHE, but > that will require more effort and coordination amongst the NFS kernel > team. Thanks Mike, I'll add it to the part 2 mix. -- Jens Axboe