This picks up a discussion we had at bakeathon - I'll try to summarize quickly. There's been new reports of TEST_STATEID storms where clients spend all of their cpu and network resources sending TEST_STATEID. In network captures, we see both SEQ4_STATUS_RECALLABLE_STATE_REVOKED and SEQ4_STATUS_CB_PATH_DOWN. Now we can see that the NFS server really is seeing the callback channel drop, and we see -ERESTARTSYS from nfsd_cb_done and -EINVAL from nfs_cb_setup_err. I think the server may be spuriously shutting down the callback rpc_client, which does rpc_killall_tasks for any pending callbacks. I started playing with the upstream client, and noticed that if the client is idle with nconnect > 1, the XS_IDLE_DISC_TO (5 minutes) can take down the connection with the callback channel for v4.1. We recently prioritized this first connection, perhaps we can disable the idle timeout for it. There's some weird behavior for nconnect=16, we only get 12 connections at first, then my client usually only primes 5 of them with a SEQUENCE within the next 5 minutes, and tears down the callback connection, then re-connects all 16 again. This whole situation makes delegations a huge net loss in this setup. Can anyone remember why we wanted XS_IDLE_DISC_TO back in the single-connection TCP days? Ben