Re: NFS write congestion size

"NeilBrown" <neilb@xxxxxxx> · Thu, 30 May 2024 07:47:49 +1000

On Thu, 30 May 2024, Trond Myklebust wrote:
> On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
> > Hello,
> > 
> > so I was investigating why random writes to a large file over NFS got
> > noticeably slower. The workload we use to test this is this fio
> > command:
> > 
> > fio --direct=0 --ioengine=sync --thread --directory=/mnt --
> > invalidate=1 \
> >     --group_reporting=1 --runtime=300 --fallocate=posix --
> > ramp_time=10 \
> >     --name=RandomWrites-async-257024-4k-4 --new_group --rw=randwrite
> > \
> >     --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
> > end_fsync=1 \
> >     --filename_format='FioWorkloads.$jobnum'
> > 
> > Eventually I've tracked down the regression to be caused by
> > 6df25e58532b
> > ("nfs: remove reliance on bdi congestion") which changed the
> > congestion
> > mechanism from a generic bdi congestion handling to NFS private one.
> > Before
> > this commit the fio achieved throughput of 180 MB/s, after this
> > commit only
> > 120 MB/s. Now part of the regression was actually caused by
> > inefficient
> > fsync(2) and the fact that more dirty data was cached at the time of
> > the
> > last fsync after commit 6df25e58532b. After fixing fsync [1], the
> > throughput got to 150 MB/s so better but still not quite the
> > throughput
> > before 6df25e58532b.
> > 
> > The reason for remaining regression is that bdi congestion handling
> > was
> > broken and the client had happily ~8GB of outstanding IO against the
> > server
> > despite the congestion limit was 256 MB. The new congestion handling
> > actually works but as a result the server does not have enough dirty
> > data
> > to efficiently operate on and the server disk often gets idle before
> > the
> > client can send more.
> > 
> > I wanted to discuss possible solutions here.
> > 
> > Generally 256MB is not enough even for consumer grade contemporary
> > disks to
> > max out throughput. There is tunable
> > /proc/sys/fs/nfs/nfs_congestion_kb.
> > If I tweak it to say 1GB, that is enough to give the server enough
> > data to
> > saturate the disk (most of the time) and fio reaches 180MB/s as
> > before
> > commit 6df25e58532b. So one solution to the problem would be to
> > change the
> > default of nfs_congestion_kb to 1GB.
> > 
> > Generally the problem with this tuning is that faster disks may need
> > even
> > larger nfs_congestion_kb, the NFS client has no way of knowing what
> > the
> > right value of nfs_congestion_kb is. I personally find the concept of
> > client throttling writes to the server flawed. The *server* should
> > push
> > back (or throttle) if the client is too aggressively pushing out the
> > data
> > and then the client can react to this backpressure. Because only the
> > server
> > knows how much it can handle (also given the load from other
> > clients). And
> > I believe this is actually what is happening in practice (e.g. when I
> > tune
> > nfs_congestion_kb to really high number). So I think even better
> > solution
> > may be to just remove the write congestion handling from the client
> > completely. The history before commit 6df25e58532b, when congestion
> > was
> > effectively ignored, shows that this is unlikely to cause any
> > practical
> > problems. What do people think?
> 
> I think we do still need a mechanism to prevent the client from pushing
> more writebacks into the RPC layer when the server throttling is
> causing RPC transmission queues to build up. Otherwise we end up
> increasing the latency when the application is trying to do more I/O to
> pages that are queued up for writeback in the RPC layer (since the
> latter will be write locked).

If latency is the issue, could be make the throttling time-based?
So we throttle if the oldest queued request is more than X seconds old?
How easy is it to find the oldest queued request?

NeilBrown