On Tue, Nov 12, 2024 at 11:50:46AM +0200, Kirill A. Shutemov wrote: > On Tue, Nov 12, 2024 at 07:02:33PM +1100, Dave Chinner wrote: > > I think the post-IO invalidation that these IOs do is largely > > irrelevant to how the page cache processes the write. Indeed, > > from userspace, the functionality in this patchset would be > > implemented like this: > > > > oneshot_data_write(fd, buf, len, off) > > { > > /* write into page cache */ > > pwrite(fd, buf, len, off); > > > > /* force the write through the page cache */ > > sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); > > > > /* Invalidate the single use data in the cache now it is on disk */ > > posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED); > > } > > > > Allowing the application to control writeback and invalidation > > granularity is a much more flexible solution to the problem here; > > when IO is sequential, delayed allocation will be allowed to ensure > > large contiguous extents are created and that will greatly reduce > > file fragmentation on XFS, btrfs, bcachefs and ext4. For random > > writes, it'll submit async IOs in batches... > > > > Given that io_uring already supports sync_file_range() and > > posix_fadvise(), I'm wondering why we need an new IO API to perform > > this specific write-through behaviour in a way that is less flexible > > than what applications can already implement through existing > > APIs.... > > Attaching the hint to the IO operation allows kernel to keep the data in > page cache if it is there for other reason. You cannot do it with a > separate syscall. Sure we can. FADV_NOREUSE is attached to the struct file - that's available to every IO that is done on that file. Hence we know before we start every IO on that file if we only need to preserve existing page cache or all data we access. Having a file marked like this doesn't affect any other application that is accessing the same inode. It just means that the specific fd opened by a specific process will not perturb the long term residency of the page cache on that inode. > Consider a scenario of a nightly backup of the data. The same data is in > cache because the actual workload needs it. You don't want backup task to > invalidate the data from cache. Your snippet would do that. The code I presented was essentially just a demonstration of what "uncached IO" was doing. That it is actually cached IO, and that it can be done from userspace right now. Yes, it's not exactly the same cache invalidation semantics, but that's not the point. The point was that the existing APIs are *much more flexible* than this proposal, and we don't actually need new kernel functionality for applications to see the same benchmark results as Jens has presented. All they need to do is be modified to use existing APIs. The additional point to that end is that FADV_NOREUSE should be hooke dup to the conditional cache invalidation mechanism Jens added to the page cache IO paths. Then we have all the functionality of this patch set individually selectable by userspace applications without needing a new IO API to be rolled out. i.e. the snippet then bcomes: /* don't cache after IO */ fadvise(fd, FADV_NORESUSE) .... write(fd, buf, len, off); /* write through */ sync_file_range(fd, off, len, SYNC_FILE_RANGE); Note how this doesn't need to block in sync_file_range() before doing the invalidation anymore? We've separated the cache control behaviour from the writeback behaviour. We can now do both write back and write through buffered writes that clean up the page cache after IO completion has occurred - write-through is not restricted to uncached writes, nor is the cache purge after writeback completion. IOWs, we can do: /* don't cache after IO */ fadvise(fd, FADV_NORESUSE) .... off = pos; count = 4096; while (off < pos + len) { ret = write(fd, buf, count, off); /* get more data and put it in buf */ off += ret; } /* write through */ sync_file_range(fd, pos, len, SYNC_FILE_RANGE); And now we only do one set of writeback on the file range, instead of one per IO. And we still get the page cache being released on writeback Io completion. This is a *much* better API for IO and page cache control. It is not constrained to individual IOs, so applications can allow the page cache to write-combine data from multiple syscalls into a single physical extent allocation and writeback IO. This is much more efficient for modern filesytsems - the "writeback per IO" model forces filesystms like XFS and ext4 to work like ext3 did, and defeats buffered write IO optimisations like dealyed allocation. If we are going to do small "allocation and write IO" patterns, we may as well be using direct IO as it is optimised for that sort of behaviour. So let's conside the backup application example. IMO, backup applications really don't want to use this new uncached IO mechanism for either reading or writing data. Backup programs do sequential data read IO as they walk the backup set - if they are doing buffered IO then we -really- want readahead to be active. However, uncached IO turns off readahead, which is the equivalent of the backup application doing: fadvise(fd, FADV_RANDOM); while (len > 0) { ret = read(fd, buf, len, off); fadvise(fd, FADV_DONTNEED, off, len); /* do stuff with buf */ off += ret; len -= ret; } Sequential buffered read IO after setting FADV_RANDOM absolutely *sucks* from a performance perspective. This is when FADV_NOREUSE is useful. We can leave readahead turned on, and when we do the first read from the page cache after readahead completes, we can then apply the NOREUSE policy. i.e. if the data we are reading has not been accessed, then turf it after reading if NOREUSE is set. If the data was already resident in cache, then leave it there as per a normal read. IOWs, if we separate the cache control from the read IO itself, there is no need to turn off readahead to implement "drop cache on-read" semantics. We just need to know if the folio has been accessed or not to determine what to do with it. Let's also consider the backup data file - that is written sequentially. It's going to be large and we don't know it's size ahead of time. If we are using buffered writes we want delayed allocation to optimise the file layout and hence writeback IO throughput. We also want to drop the page cache when writeback eventually happens, but we really don't want writeback to happen on every write. IOWs, backup programs can take advantage of "drop cache when clean" semantics, but can't really take any significant advantage from per-IO write-through semantics. IOWs, backup applications really want per-file NOREUSE write semantics that are seperately controlled w.r.t. cache write-through behaviour. One of the points I tried to make was that the uncached IO proposal smashes multiple disparate semantics into a single per-IO control bit. The backup application example above shows exactly how that API isn't actually very good for the applications that could benefit from the functionality this patchset adds to the page cache to support that single control bit... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx