TEST_STATEID storms

Benjamin Coddington <bcodding@xxxxxxxxxx> · Thu, 19 Oct 2023 11:50:16 -0400

This picks up a discussion we had at bakeathon - I'll try to summarize quickly.

There's been new reports of TEST_STATEID storms where clients spend all of
their cpu and network resources sending TEST_STATEID.

In network captures, we see both SEQ4_STATUS_RECALLABLE_STATE_REVOKED and
SEQ4_STATUS_CB_PATH_DOWN.

Now we can see that the NFS server really is seeing the callback channel
drop, and we see -ERESTARTSYS from nfsd_cb_done and -EINVAL from
nfs_cb_setup_err.  I think the server may be spuriously shutting down the
callback rpc_client, which does rpc_killall_tasks for any pending callbacks.

I started playing with the upstream client, and noticed that if the client
is idle with nconnect > 1, the XS_IDLE_DISC_TO (5 minutes) can take down the
connection with the callback channel for v4.1.  We recently prioritized this
first connection, perhaps we can disable the idle timeout for it.

There's some weird behavior for nconnect=16, we only get 12 connections at
first, then my client usually only primes 5 of them with a SEQUENCE within
the next 5 minutes, and tears down the callback connection, then re-connects
all 16 again.

This whole situation makes delegations a huge net loss in this setup.

Can anyone remember why we wanted XS_IDLE_DISC_TO back in the
single-connection TCP days?

Ben