> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: > > On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: >> >> On 18.02.2021 19:30, Olga Kornievskaia wrote: >>> Thank you for getting tracepoints from a busy server but can you get >>> more? As suspected, the server is having issues sending the callback. >>> I'm not sure why. Any chance to turn on the server's sunrpc >>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if >>> we can any more info about why it's failing? >> >> I isolated out two of the machines on that cluster now, one acting as >> NFS server from an ext4 mount, the other is the same client as before. >> That way I managed to capture a trace and ibdump of an entire cycle: >> mount + successful copy + 5 minutes later a copy that got stuck >> >> Next to no noise happened during those traces, you can find them attached. >> >> Another observation made due to this: unmount and re-mounting the NFS >> share also gets it back into working condition for a while, no reboot >> necessary. >> During this trace, I got "lucky", and after just 5 minutes of waiting, >> it got stuck. >> >> Before that, I had a run of mount + trying to copy every 5 minutes where >> it ran for 45 minutes without getting stuck. At which point I decided to >> remount once more. > > Timo, thank you for gathering the debug info. > > Chuck, I need your help. Why would the server lose a callback channel? > > <...>-1461944 [001] 2076465.200151: rpc_request: > task:57752@6 nfs4_cbv1 CB_OFFLOAD (async) > <...>-1461944 [001] 2076465.200151: rpc_task_run_action: > task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT > runstate=RUNNING|ACTIVE status=0 action=call_reserve > <...>-1461944 [001] 2076465.200154: xprt_reserve: > task:57752@6 xid=0x00a0aaf9 > <...>-1461944 [001] 2076465.200155: rpc_task_run_action: > task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT > runstate=RUNNING|ACTIVE status=0 action=call_reserveresult > <...>-1461944 [001] 2076465.200156: rpc_task_run_action: > task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT > runstate=RUNNING|ACTIVE status=0 action=call_refresh > <...>-1461944 [001] 2076465.200163: rpc_task_run_action: > task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT > runstate=RUNNING|ACTIVE status=0 action=call_refreshresult > <...>-1461944 [001] 2076465.200163: rpc_task_run_action: > task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT > runstate=RUNNING|ACTIVE status=0 action=call_allocate > <...>-1461944 [001] 2076465.200168: rpc_buf_alloc: > task:57752@6 callsize=548 recvsize=104 status=0 > <...>-1461944 [001] 2076465.200168: rpc_task_run_action: > task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT > runstate=RUNNING|ACTIVE status=0 action=call_encode > <...>-1461944 [001] 2076465.200173: rpc_task_run_action: > task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT > runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 > action=call_connect > <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror: > task:57752@6 tk_status=-107 rpc_status=-107 > <...>-1461944 [001] 2076465.200174: rpc_task_run_action: > task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT > runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 > action=rpc_exit_task > > It's reporting ENOTCON. I'm not really sure if this is related to copy > offload but more perhaps doing callbacks over RDMA/IB. If the client is awaiting a COPY notification callback, I can see why a lost CB channel would cause the client to wait indefinitely. The copy mechanism has to deal with this contingency... Perhaps the server could force a disconnect so that the client and server have an opportunity to re-establish the callback channel. <shrug> In any event, the trace log above shows nothing more than "hey, it's not working." Are there any rpcrdma trace events we can look at to determine why the backchannel is getting lost? -- Chuck Lever