Re: NFS write congestion size

Sagi Grimberg <sagi@xxxxxxxxxxx> · Thu, 30 May 2024 21:14:27 +0300

On 30/05/2024 14:13, Trond Myklebust wrote:
On Thu, 2024-05-30 at 13:01 +0300, Sagi Grimberg wrote:

On 30/05/2024 11:31, Jan Kara wrote:
On Thu 30-05-24 10:44:01, Sagi Grimberg wrote:
On 29/05/2024 20:05, Trond Myklebust wrote:
On Wed, 2024-05-29 at 18:11 +0200, Jan Kara wrote:
Hello,

so I was investigating why random writes to a large file over
NFS got
noticeably slower. The workload we use to test this is this
fio
command:

fio --direct=0 --ioengine=sync --thread --directory=/mnt --
invalidate=1 \
       --group_reporting=1 --runtime=300 --fallocate=posix --
ramp_time=10 \
       --name=RandomWrites-async-257024-4k-4 --new_group --
rw=randwrite
\
       --size=32000m --numjobs=4 --bs=4k --fsync_on_close=1 --
end_fsync=1 \
       --filename_format='FioWorkloads.$jobnum'

Eventually I've tracked down the regression to be caused by
6df25e58532b
("nfs: remove reliance on bdi congestion") which changed the
congestion
mechanism from a generic bdi congestion handling to NFS
private one.
Before
this commit the fio achieved throughput of 180 MB/s, after
this
commit only
120 MB/s. Now part of the regression was actually caused by
inefficient
fsync(2) and the fact that more dirty data was cached at the
time of
the
last fsync after commit 6df25e58532b. After fixing fsync [1],
the
throughput got to 150 MB/s so better but still not quite the
throughput
before 6df25e58532b.

The reason for remaining regression is that bdi congestion
handling
was
broken and the client had happily ~8GB of outstanding IO
against the
server
despite the congestion limit was 256 MB. The new congestion
handling
actually works but as a result the server does not have
enough dirty
data
to efficiently operate on and the server disk often gets idle
before
the
client can send more.

I wanted to discuss possible solutions here.

Generally 256MB is not enough even for consumer grade
contemporary
disks to
max out throughput. There is tunable
/proc/sys/fs/nfs/nfs_congestion_kb.
If I tweak it to say 1GB, that is enough to give the server
enough
data to
saturate the disk (most of the time) and fio reaches 180MB/s
as
before
commit 6df25e58532b. So one solution to the problem would be
to
change the
default of nfs_congestion_kb to 1GB.

Generally the problem with this tuning is that faster disks
may need
even
larger nfs_congestion_kb, the NFS client has no way of
knowing what
the
right value of nfs_congestion_kb is. I personally find the
concept of
client throttling writes to the server flawed. The *server*
should
push
back (or throttle) if the client is too aggressively pushing
out the
data
and then the client can react to this backpressure. Because
only the
server
knows how much it can handle (also given the load from other
clients). And
I believe this is actually what is happening in practice
(e.g. when I
tune
nfs_congestion_kb to really high number). So I think even
better
solution
may be to just remove the write congestion handling from the
client
completely. The history before commit 6df25e58532b, when
congestion
was
effectively ignored, shows that this is unlikely to cause any
practical
problems. What do people think?
I think we do still need a mechanism to prevent the client from
pushing
more writebacks into the RPC layer when the server throttling
is
causing RPC transmission queues to build up. Otherwise we end
up
increasing the latency when the application is trying to do
more I/O to
pages that are queued up for writeback in the RPC layer (since
the
latter will be write locked).
Plus the server is likely serving multiple clients, so removing
any type
of congestion handling from the client may overwhelm the server.
I understand this concern but before commit 6df25e58532b we
effectively
didn't do any throttling for years and nobody complained.

Commit 6df25e58532b doesn't add throttling. It just converts from using
the bdi based mechanism to using a NFS-specific one.
That bdi based throttling mechanism dates at least back to 2007,
although there was code before that dating back to pre-git history.

don't know about the history nor what people could have attributed
problems.

   So servers
apparently know how to cope with clients sending too much IO to
them.
not sure how an nfs server would cope with this. nfsv4 can reduce
slots,
but not
sure what nfsv3 server would do...

btw, I think you meant that *slower* devices may need a larger queue
to
saturate,
because if the device is fast, 256MB inflight is probably enough...
So
you are solving
for the "consumer grade contemporary disks".

It is hard to do server side congestion control with UDP, since it does
not have a native congestion mechanism to leverage.

However connection based transports are essentially a queuing mechanism
as far as the server is concerned. It can trivially push back on the
client by slowing down the rate at which it pulls RPC calls from the
transport (or stopping them altogether). That's a mechanism that works
just fine for both TCP and RDMA.

Yes, I agree. tcp throttling works. Just has some downsides. I was 
referring to a higher-level
credit based flow control.

Additionally, NFSv4 has the session slot mechanism, and while that can
be used as a throttling mechanism, it is more about providing safe
only-once replay semantics.

OK, that is good to know. I assumed that it is designed to be used for 
throttling.

  A prudent server implementation would not rely on it to replace transport level throttling.

Oh, definitely not replace. But I get your point.