On Thu, 30 May 2024, Trond Myklebust wrote: > On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote: > > Hello, > > > > so I was investigating why random writes to a large file over NFS got > > noticeably slower. The workload we use to test this is this fio > > command: > > > > fio --direct=0 --ioengine=sync --thread --directory=/mnt -- > > invalidate=1 \ > > --group_reporting=1 --runtime=300 --fallocate=posix -- > > ramp_time=10 \ > > --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite > > \ > > --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 -- > > end_fsync=1 \ > > --filename_format='FioWorkloads.$jobnum' > > > > Eventually I've tracked down the regression to be caused by > > 6df25e58532b > > ("nfs: remove reliance on bdi congestion") which changed the > > congestion > > mechanism from a generic bdi congestion handling to NFS private one. > > Before > > this commit the fio achieved throughput of 180 MB/s, after this > > commit only > > 120 MB/s. Now part of the regression was actually caused by > > inefficient > > fsync(2) and the fact that more dirty data was cached at the time of > > the > > last fsync after commit 6df25e58532b. After fixing fsync [1], the > > throughput got to 150 MB/s so better but still not quite the > > throughput > > before 6df25e58532b. > > > > The reason for remaining regression is that bdi congestion handling > > was > > broken and the client had happily ~8GB of outstanding IO against the > > server > > despite the congestion limit was 256 MB. The new congestion handling > > actually works but as a result the server does not have enough dirty > > data > > to efficiently operate on and the server disk often gets idle before > > the > > client can send more. > > > > I wanted to discuss possible solutions here. > > > > Generally 256MB is not enough even for consumer grade contemporary > > disks to > > max out throughput. There is tunable > > /proc/sys/fs/nfs/nfs_congestion_kb. > > If I tweak it to say 1GB, that is enough to give the server enough > > data to > > saturate the disk (most of the time) and fio reaches 180MB/s as > > before > > commit 6df25e58532b. So one solution to the problem would be to > > change the > > default of nfs_congestion_kb to 1GB. > > > > Generally the problem with this tuning is that faster disks may need > > even > > larger nfs_congestion_kb, the NFS client has no way of knowing what > > the > > right value of nfs_congestion_kb is. I personally find the concept of > > client throttling writes to the server flawed. The *server* should > > push > > back (or throttle) if the client is too aggressively pushing out the > > data > > and then the client can react to this backpressure. Because only the > > server > > knows how much it can handle (also given the load from other > > clients). And > > I believe this is actually what is happening in practice (e.g. when I > > tune > > nfs_congestion_kb to really high number). So I think even better > > solution > > may be to just remove the write congestion handling from the client > > completely. The history before commit 6df25e58532b, when congestion > > was > > effectively ignored, shows that this is unlikely to cause any > > practical > > problems. What do people think? > > I think we do still need a mechanism to prevent the client from pushing > more writebacks into the RPC layer when the server throttling is > causing RPC transmission queues to build up. Otherwise we end up > increasing the latency when the application is trying to do more I/O to > pages that are queued up for writeback in the RPC layer (since the > latter will be write locked). If latency is the issue, could be make the throttling time-based? So we throttle if the oldest queued request is more than X seconds old? How easy is it to find the oldest queued request? NeilBrown