Part of the testing we have done with LBS was to do some performance tests on XFS to ensure things are not regressing. Building linux is a fine decent test and we did some random cloud instance tests on that and presented that at Plumbers, but it doesn't really cut it if we want to push things to the limit though. What are the limits to buffered IO and how do we test that? Who keeps track of it? The obvious recurring tension is that for really high performance folks just recommend to use birect IO. But if you are stress testing changes to a filesystem and want to push buffered IO to its limits it makes sense to stick to buffered IO, otherwise how else do we test it? It is good to know limits to buffered IO too because some workloads cannot use direct IO. For instance PostgreSQL doesn't have direct IO support and even as late as the end of last year we learned that adding direct IO to PostgreSQL would be difficult. Chris Mason has noted also that direct IO can also force writes during reads (?)... Anyway, testing the limits of buffered IO limits to ensure you are not creating regressions when doing some page cache surgery seems like it might be useful and a sensible thing to do .... The good news is we have not found regressions with LBS but all the testing seems to beg the question, of what are the limits of buffered IO anyway, and how does it scale? Do we know, do we care? Do we keep track of it? How does it compare to direct IO for some workloads? How big is the delta? How do we best test that? How do we automate all that? Do we want to automatically test this to avoid regressions? The obvious issues with some workloads for buffered IO is having a possible penality if you are not really re-using folios added to the page cache. Jens Axboe reported a while ago issues with workloads with random reads over a data set 10x the size of RAM and also proposed RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more like direct IO with kernel pages and a memcpy(), and it requires further serialization to be implemented that we already do for direct IO for writes. There at least seems to be agreement that if we're going to provide an enhancement or alternative that we should strive to not make the same mistakes we've done with direct IO. The rationale for some workloads to use buffered IO is it helps reduce some tail latencies, so that's something to live up to. On that same thread Christoph also mentioned the possibility of a direct IO variant which can leverage the cache. Is that something we want to move forward with? Chris Mason also listed a few other desirables if we do: - Allowing concurrent writes (xfs DIO does this now) - Optionally doing zero copy if alignment is good (btrfs DIO does this now) - Optionally tossing pages at the end (requires a separate syscall now) - Supporting aio via io_uring I recently ran a different type of simple test, focused on sequantial writes to fill capacity, with write workload essentially matching your RAM, so having parity with your RAM. Technically in the case of max size that I tested the writes were just *slightly* over the RAM, that's a minor technicality given I did other tests with similar sizes which showed similar results... This test should be possible to reproduce then if you have more than enough RAM to spare. In this case the system uses 1 TiB RAM, using pmem to avoid drive variance / GC / and other drive shenanigans. So pmem grub setup: memmap=500G!4G memmap=3G!504G As noted earlier, surely, DIO / DAX is best for pmem (and I actually get a difference between using just DIO and DAX, but that digresses), but when one is wishing to test buffered IO on purpose it makes sense to do this. Yes, we can test tmpfs too... but I believe that topic will be brought up at LSFMM separately. The delta with DIO and buffered IO on XFS is astronomical: ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 Vs ~ 7,000 MiB/s with buffered IO Let's recap, the system had 1 TiB RAM, about half was dedicated for pmem: it used two pmem paritions one 500 GiB for XFS and another 3 GiB for an XFS external journal. The system has DDR5 which has a cap at 35.76 GiB/s. The proprietary intel tool to test RAM crashes on me... and leaves a kernel splat that seems to indicate the tool is doing something we don't allow anymore. So I couldn't get an aggregate RAM baseline on just RAM. The kernel used was 6.6.0-rc5 + our LBS changes on top. The buffered io fio incantation: fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring -direct=0 --group_reporting=1 --alloc-size=1048576 --filesize=1GiB --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1 --directory=/mnt This dedicates 1 GiB to scale up to nproc (128) threads, it will then use each thread to write 1 GiB of data at a time 10 times. So 10 files of 1 GiB per thread. So this easily fills until we're out of space. Each threads ends up writing about ~300 - 400 MiB per each of its 10 files as we run out of space fast. -direct=1 for direct IO and -direct=0 for buffered IO. Perhaps this is a silly topic to bring up, I don't know, but it seemed to me just strikinly low the max throughput wall on writes given the amount of free memory on the system. Reducing the data written to say just 128 GiB makes the throughput climb to about ~ 13 GiB/s for buffered IO... but doubling that to 256 GiB we get only 9276 MiB/s. So even if we have twice the RAM available we hit a throughput wall which is *still* astronomically different than direct IO with this setup. Should we talk about these things at LSFMM? Are folks interested? [0] https://lwn.net/Articles/806980/ [1] https://patchwork.kernel.org/project/linux-fsdevel/cover/20191211152943.2933-1-axboe@xxxxxxxxx/ Luis