NFS write congestion size

Jan Kara <jack@xxxxxxx> · Wed, 29 May 2024 18:11:02 +0200

Hello,

so I was investigating why random writes to a large file over NFS got
noticeably slower. The workload we use to test this is this fio command:

fio --direct=0 --ioengine=sync --thread --directory=/mnt --invalidate=1 \
    --group_reporting=1 --runtime=300 --fallocate=posix --ramp_time=10 \
    --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite \
    --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 \
    --filename_format='FioWorkloads.$jobnum'

Eventually I've tracked down the regression to be caused by 6df25e58532b
("nfs: remove reliance on bdi congestion") which changed the congestion
mechanism from a generic bdi congestion handling to NFS private one. Before
this commit the fio achieved throughput of 180 MB/s, after this commit only
120 MB/s. Now part of the regression was actually caused by inefficient
fsync(2) and the fact that more dirty data was cached at the time of the
last fsync after commit 6df25e58532b. After fixing fsync [1], the
throughput got to 150 MB/s so better but still not quite the throughput
before 6df25e58532b.

The reason for remaining regression is that bdi congestion handling was
broken and the client had happily ~8GB of outstanding IO against the server
despite the congestion limit was 256 MB. The new congestion handling
actually works but as a result the server does not have enough dirty data
to efficiently operate on and the server disk often gets idle before the
client can send more.

I wanted to discuss possible solutions here.

Generally 256MB is not enough even for consumer grade contemporary disks to
max out throughput. There is tunable /proc/sys/fs/nfs/nfs_congestion_kb.
If I tweak it to say 1GB, that is enough to give the server enough data to
saturate the disk (most of the time) and fio reaches 180MB/s as before
commit 6df25e58532b. So one solution to the problem would be to change the
default of nfs_congestion_kb to 1GB.

Generally the problem with this tuning is that faster disks may need even
larger nfs_congestion_kb, the NFS client has no way of knowing what the
right value of nfs_congestion_kb is. I personally find the concept of
client throttling writes to the server flawed. The *server* should push
back (or throttle) if the client is too aggressively pushing out the data
and then the client can react to this backpressure. Because only the server
knows how much it can handle (also given the load from other clients). And
I believe this is actually what is happening in practice (e.g. when I tune
nfs_congestion_kb to really high number). So I think even better solution
may be to just remove the write congestion handling from the client
completely. The history before commit 6df25e58532b, when congestion was
effectively ignored, shows that this is unlikely to cause any practical
problems. What do people think?

								Honza

[1] https://lore.kernel.org/all/20240524161419.18448-1-jack@xxxxxxx
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR