On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > I recently ran a different type of simple test, focused on sequantial writes > to fill capacity, with write workload essentially matching your RAM, so > having parity with your RAM. Technically in the case of max size that I > tested the writes were just *slightly* over the RAM, that's a minor > technicality given I did other tests with similar sizes which showed similar > results... This test should be possible to reproduce then if you have more > than enough RAM to spare. In this case the system uses 1 TiB RAM, using > pmem to avoid drive variance / GC / and other drive shenanigans. > > So pmem grub setup: > > memmap=500G!4G memmap=3G!504G > > As noted earlier, surely, DIO / DAX is best for pmem (and I actually get > a difference between using just DIO and DAX, but that digresses), but > when one is wishing to test buffered IO on purpose it makes sense to do > this. Yes, we can test tmpfs too... but I believe that topic will be > brought up at LSFMM separately. The delta with DIO and buffered IO on > XFS is astronomical: > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > Vs > ~ 7,000 MiB/s with buffered IO You're not testing apples to apples. Buffered writes to the same superblock serialise on IO submission, not write() calls, so it doesn't matter how much concurrency you have in write() syscalls. That is, streaming buffered write throughput is entirely limited by the number of IOs that the bdi flusher thread can submit. For ext4, XFS and btrfs, delayed allocation means that this writeback thread is also doing extent allocation for all IO, and hence the single writeback thread for buffered writes is the performance limiting factor for them. It doesn't matter how fast you can copy in to the kernel, it can only drain as fast as it can submit IO. As soon as this writeback thread is CPU bound, incoming buffered write()s will be throttle back to the rate at which memory can be cleaned by the writeback thread. Direct IO doesn't have this limitation - it's an orange in comparison because IO is always submitted by the task that does the write() syscall. Hence it inherently scales out to the limit of the underlying hardware and it is not limited by the throughput of a single CPU like page cache writeback is. If you wonder why people are saying "issue sync_file_range() periodically" to improved buffered write throughput, it's because it moves the async writeback submission for that inode out of the single background writeback thread and into task context where IO submission can be trivially parallelised. Just like direct IO.... IOWs, the issue you are demonstrating is the inherent limitations in single threaded write-behind cache flushing, and the solution to that specific bottleneck is to enable concurrent writeback submission from the same file and/or superblock via various available manual mechanisms. An automatic way of doing this for large streaming writes is switch from write-behind to near-write-through, such that the majority of write IO is submitted asynchronously from the write() syscall. Think of how readahead from read() context pulls in data that is likely to be needed soon - sequential writes should trigger similar behaviour where we do async write-behind of the previous write()s in the context of the current write. Track a sequential write window like we do readahead, and trigger async writeback for such streaming writes from the write() context... That doesn't solve the huge tarball problem where we create millions of small files in a couple of seconds, then have to wait for single threaded writeback to drain them to the storage at 50,000 files/s. We can create files and get the data into the cache far faster and with way more concurrency than the page cache can push the data back to the storage itself. IOWs, the problems with page cache write throughput really have nothing to do with write() scalability, folios or filesystem block sizes. The fundamental problem is single-threaded writeback IO submission and that throttling incoming writes to whatever speed it runs at when CPU bound.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx