On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: > > On 18.02.2021 19:30, Olga Kornievskaia wrote: > > Thank you for getting tracepoints from a busy server but can you get > > more? As suspected, the server is having issues sending the callback. > > I'm not sure why. Any chance to turn on the server's sunrpc > > tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if > > we can any more info about why it's failing? > > I isolated out two of the machines on that cluster now, one acting as > NFS server from an ext4 mount, the other is the same client as before. > That way I managed to capture a trace and ibdump of an entire cycle: > mount + successful copy + 5 minutes later a copy that got stuck > > Next to no noise happened during those traces, you can find them attached. > > Another observation made due to this: unmount and re-mounting the NFS > share also gets it back into working condition for a while, no reboot > necessary. > During this trace, I got "lucky", and after just 5 minutes of waiting, > it got stuck. > > Before that, I had a run of mount + trying to copy every 5 minutes where > it ran for 45 minutes without getting stuck. At which point I decided to > remount once more. Timo, thank you for gathering the debug info. Chuck, I need your help. Why would the server lose a callback channel? <...>-1461944 [001] 2076465.200151: rpc_request: task:57752@6 nfs4_cbv1 CB_OFFLOAD (async) <...>-1461944 [001] 2076465.200151: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_reserve <...>-1461944 [001] 2076465.200154: xprt_reserve: task:57752@6 xid=0x00a0aaf9 <...>-1461944 [001] 2076465.200155: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_reserveresult <...>-1461944 [001] 2076465.200156: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_refresh <...>-1461944 [001] 2076465.200163: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_refreshresult <...>-1461944 [001] 2076465.200163: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_allocate <...>-1461944 [001] 2076465.200168: rpc_buf_alloc: task:57752@6 callsize=548 recvsize=104 status=0 <...>-1461944 [001] 2076465.200168: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_encode <...>-1461944 [001] 2076465.200173: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror: task:57752@6 tk_status=-107 rpc_status=-107 <...>-1461944 [001] 2076465.200174: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 action=rpc_exit_task It's reporting ENOTCON. I'm not really sure if this is related to copy offload but more perhaps doing callbacks over RDMA/IB.