On Tue, Dec 03, 2024 at 08:31:36AM -0700, Jens Axboe wrote: > Hi, > > 5 years ago I posted patches adding support for RWF_UNCACHED, as a way > to do buffered IO that isn't page cache persistent. The approach back > then was to have private pages for IO, and then get rid of them once IO > was done. But that then runs into all the issues that O_DIRECT has, in > terms of synchronizing with the page cache. > > So here's a new approach to the same concent, but using the page cache > as synchronization. That makes RWF_UNCACHED less special, in that it's > just page cache IO, except it prunes the ranges once IO is completed. > > Why do this, you may ask? The tldr is that device speeds are only > getting faster, while reclaim is not. Doing normal buffered IO can be > very unpredictable, and suck up a lot of resources on the reclaim side. > This leads people to use O_DIRECT as a work-around, which has its own > set of restrictions in terms of size, offset, and length of IO. It's > also inherently synchronous, and now you need async IO as well. While > the latter isn't necessarily a big problem as we have good options > available there, it also should not be a requirement when all you want > to do is read or write some data without caching. > > Even on desktop type systems, a normal NVMe device can fill the entire > page cache in seconds. On the big system I used for testing, there's a > lot more RAM, but also a lot more devices. As can be seen in some of the > results in the following patches, you can still fill RAM in seconds even > when there's 1TB of it. Hence this problem isn't solely a "big > hyperscaler system" issue, it's common across the board. > > Common for both reads and writes with RWF_UNCACHED is that they use the > page cache for IO. Reads work just like a normal buffered read would, > with the only exception being that the touched ranges will get pruned > after data has been copied. For writes, the ranges will get writeback > kicked off before the syscall returns, and then writeback completion > will prune the range. Hence writes aren't synchronous, and it's easy to > pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by > RWF_UNCACHED IO are left untouched. This means you that uncached IO > will take advantage of the page cache for uptodate data, but not leave > anything it instantiated/created in cache. > > File systems need to support this. The patches add support for the > generic filemap helpers, and for iomap. Then ext4 and XFS are marked as > supporting it. The last patch adds support for btrfs as well, lightly > tested. The read side is already done by filemap, only the write side > needs a bit of help. The amount of code here is really trivial, and the > only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO > return -EOPNOTSUPP just in case the fs doesn't use either the generic > paths or iomap. Adding "support" to other file systems should be > trivial, most of the time just a one-liner adding FOP_UNCACHED to the > fop_flags in the file_operations struct. > > Performance results are in patch 8 for reads and patch 10 for writes, > with the tldr being that I see about a 65% improvement in performance > for both, with fully predictable IO times. CPU reduction is substantial > as well, with no kswapd activity at all for reclaim when using uncached > IO. > > Using it from applications is trivial - just set RWF_UNCACHED for the > read or write, using pwritev2(2) or preadv2(2). For io_uring, same > thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write > operation. And that's it. > > Patches 1..7 are just prep patches, and should have no functional > changes at all. Patch 8 adds support for the filemap path for > RWF_UNCACHED reads, patch 11 adds support for filemap RWF_UNCACHED > writes. In the below mentioned branch, there are then patches to > adopt uncached reads and writes for ext4, xfs, and btrfs. > > Passes full xfstests and fsx overnight runs, no issues observed. That > includes the vm running the testing also using RWF_UNCACHED on the host. > I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far > as I'm concerned, no further work needs doing here. > > And git tree for the patches is here: > > https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.8 Oh good, I much prefer browsing git branches these days. :) * mm/filemap: change filemap_create_folio() to take a struct kiocb * mm/readahead: add folio allocation helper * mm: add PG_uncached page flag * mm/readahead: add readahead_control->uncached member * mm/filemap: use page_cache_sync_ra() to kick off read-ahead * mm/truncate: add folio_unmap_invalidate() helper The mm patches look ok to me, but I think you ought to get at least an ack from willy since they're largely pagecache changes. * fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag See more detailed reply in the thread. * mm/filemap: add read support for RWF_UNCACHED Looks cleaner now that we don't even unmap the page if it's dirty. * mm/filemap: drop uncached pages when writeback completes * mm/filemap: add filemap_fdatawrite_range_kick() helper * mm/filemap: make buffered writes work with RWF_UNCACHED See more detailed reply in the thread. * mm: add FGP_UNCACHED folio creation flag I appreciate that !UNCACHED callers of __filemap_get_folio now clear the uncached bit if it's set. Now I proceed into the rest of your branch, because I felt like it: * ext4: add RWF_UNCACHED write support (Dunno about the WARN_ON removals in this patch, but this is really Ted's call anyway). * iomap: make buffered writes work with RWF_UNCACHED The commit message references a "iocb_uncached_write" but I don't find any such function in the extended patchset? Otherwise this looks ready to me. Thanks for changing it only to set uncached if we're actually creating a folio, and not just returning one that was already in the pagecache. * xfs: punt uncached write completions to the completion wq Dumb nit: spaces between "IOMAP_F_SHARED|IOMAP_F_UNCACHED" in this patch. * xfs: flag as supporting FOP_UNCACHED Otherwise the xfs changes look ready too. * btrfs: add support for uncached writes * block: support uncached IO Not sure why the definition of bio_dirty_lock gets moved around, but in principle this looks ok to me too. For the whole pile of mm changes (aka patches 1-6,8-10,12), Acked-by: "Darrick J. Wong" <djwong@xxxxxxxxxx> --D > > include/linux/fs.h | 21 +++++- > include/linux/page-flags.h | 5 ++ > include/linux/pagemap.h | 14 ++++ > include/trace/events/mmflags.h | 3 +- > include/uapi/linux/fs.h | 6 +- > mm/filemap.c | 114 +++++++++++++++++++++++++++++---- > mm/readahead.c | 22 +++++-- > mm/swap.c | 2 + > mm/truncate.c | 35 ++++++---- > 9 files changed, 187 insertions(+), 35 deletions(-) > > Since v5 > - Skip invalidation in filemap_uncached_read() if the folio is dirty > as well, retaining the uncached setting for later cleaning to do > the actual invalidation. > - Use the same trylock approach in read invalidation as the writeback > invalidation does. > - Swap order of patches 10 and 11 to fix a bisection issue. > - Split core mm changes and fs series patches. Once the generic side > has been approved, I'll send out the fs series separately. > - Rebase on 6.13-rc1 > > -- > Jens Axboe > >