Hi Timo- > On Oct 29, 2021, at 7:47 AM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: > > On 20/08/2021 17:12, Chuck Lever III wrote: >> OK, I think the issue with this reproducer was resolved >> completely with 6820bf77864d. >> I went back and reviewed the traces from when the client got >> stuck after a long uptime. This looks very different from >> what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN >> and BIND_CONN_TO_SESSION, which is a different scenario. Long >> story short, I don't think we're getting any more value by >> leaving 6820bf77864d reverted. >> Can you re-apply that commit on your server, and then when >> the client hangs again, please capture with: >> # trace-cmd record -e nfsd -e sunrpc -e rpcrdma >> I'd like to see why the client's BIND_CONN_TO_SESSION fails >> to repair the backchannel session. > > Happened again today, after a long time of no issues. > Still on 5.12.19, since the system did not have a chance for a bigger maintenance window yet. > > Attached are traces from both client and server, while the client is trying to do the usual xfs_io copy_range. > The system also has a bunch of other users and nodes working on it at this time, so there's a good chance for unrelated noise in the traces. > > The affected client is 10.110.10.251. > Other clients are working just fine, it's only this one client that's affected. > > There was also quite a bit of heavy IO work going on on the Cluster, which I think coincides with the last couple times this happened as well.<nfstrace.tar.xz> Thanks for the report. We believe this issue has been addressed in v5.15-rc: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=02579b2ff8b0becfb51d85a975908ac4ab15fba8 -- Chuck Lever