Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 26 Feb 2024 23:22:46 +1100

On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> I recently ran a different type of simple test, focused on sequantial writes
> to fill capacity, with write workload essentially matching your RAM, so
> having parity with your RAM. Technically in the case of max size that I
> tested the writes were just *slightly* over the RAM, that's a minor
> technicality given I did other tests with similar sizes which showed similar
> results... This test should be possible to reproduce then if you have more
> than enough RAM to spare. In this case the system uses 1 TiB RAM, using
> pmem to avoid drive variance / GC / and other drive shenanigans.
> 
> So pmem grub setup:
> 
> memmap=500G!4G memmap=3G!504G
> 
> As noted earlier, surely, DIO / DAX is best for pmem (and I actually get
> a difference between using just DIO and DAX, but that digresses), but
> when one is wishing to test buffered IO on purpose it makes sense to do
> this. Yes, we can test tmpfs too... but I believe that topic will be
> brought up at LSFMM separately.  The delta with DIO and buffered IO on
> XFS is astronomical:
> 
>   ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
>      Vs
>  ~ 7,000 MiB/s with buffered IO

You're not testing apples to apples.

Buffered writes to the same superblock serialise on IO submission,
not write() calls, so it doesn't matter how much concurrency you
have in write() syscalls. That is, streaming buffered write
throughput is entirely limited by the number of IOs that 
the bdi flusher thread can submit.

For ext4, XFS and btrfs, delayed allocation means that this
writeback thread is also doing extent allocation for all IO, and
hence the single writeback thread for buffered writes is the
performance limiting factor for them.

It doesn't matter how fast you can copy in to the kernel, it can
only drain as fast as it can submit IO. As soon as this writeback
thread is CPU bound, incoming buffered write()s will be throttle
back to the rate at which memory can be cleaned by the writeback
thread.

Direct IO doesn't have this limitation - it's an orange in
comparison because IO is always submitted by the task that does the
write() syscall. Hence it inherently scales out to the limit of the
underlying hardware and it is not limited by the throughput of a
single CPU like page cache writeback is.

If you wonder why people are saying "issue sync_file_range()
periodically" to improved buffered write throughput, it's because it
moves the async writeback submission for that inode out of the
single background writeback thread and into task context where IO
submission can be trivially parallelised. Just like direct IO....

IOWs, the issue you are demonstrating is the inherent limitations in
single threaded write-behind cache flushing, and the solution to
that specific bottleneck is to enable concurrent writeback
submission from the same file and/or superblock via various
available manual mechanisms.

An automatic way of doing this for large streaming writes is switch
from write-behind to near-write-through, such that the majority of
write IO is submitted asynchronously from the write() syscall. Think
of how readahead from read() context pulls in data that is likely to
be needed soon - sequential writes should trigger similar behaviour
where we do async write-behind of the previous write()s in the
context of the current write. Track a sequential write window like
we do readahead, and trigger async writeback for such streaming
writes from the write() context...

That doesn't solve the huge tarball problem where we create millions
of small files in a couple of seconds, then have to wait for
single threaded writeback to drain them to the storage at 50,000
files/s. We can create files and get the data into the cache far
faster and with way more concurrency than the page cache can push
the data back to the storage itself.

IOWs, the problems with page cache write throughput really have
nothing to do with write() scalability, folios or filesystem block
sizes. The fundamental problem is single-threaded writeback IO
submission and that throttling incoming writes to whatever speed
it runs at when CPU bound....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx