> On Feb 20, 2021, at 4:03 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: > > On 19.02.2021 19:48, Chuck Lever wrote: >>> On Feb 19, 2021, at 1:01 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: >>> >>> On 19.02.2021 18:48, Chuck Lever wrote: >>>>> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >>>>> >>>>> On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On 18.02.2021 19:30, Olga Kornievskaia wrote: >>>>>>> Thank you for getting tracepoints from a busy server but can you get >>>>>>> more? As suspected, the server is having issues sending the callback. >>>>>>> I'm not sure why. Any chance to turn on the server's sunrpc >>>>>>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if >>>>>>> we can any more info about why it's failing? >>>>>> >>>>>> I isolated out two of the machines on that cluster now, one acting as >>>>>> NFS server from an ext4 mount, the other is the same client as before. >>>>>> That way I managed to capture a trace and ibdump of an entire cycle: >>>>>> mount + successful copy + 5 minutes later a copy that got stuck >>>>>> >>>>>> Next to no noise happened during those traces, you can find them attached. >>>>>> >>>>>> Another observation made due to this: unmount and re-mounting the NFS >>>>>> share also gets it back into working condition for a while, no reboot >>>>>> necessary. >>>>>> During this trace, I got "lucky", and after just 5 minutes of waiting, >>>>>> it got stuck. >>>>>> >>>>>> Before that, I had a run of mount + trying to copy every 5 minutes where >>>>>> it ran for 45 minutes without getting stuck. At which point I decided to >>>>>> remount once more. >>>>> >>>>> Timo, thank you for gathering the debug info. >>>>> >>>>> Chuck, I need your help. Why would the server lose a callback channel? >>>>> >>>>> <...>-1461944 [001] 2076465.200151: rpc_request: >>>>> task:57752@6 nfs4_cbv1 CB_OFFLOAD (async) >>>>> <...>-1461944 [001] 2076465.200151: rpc_task_run_action: >>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT >>>>> runstate=RUNNING|ACTIVE status=0 action=call_reserve >>>>> <...>-1461944 [001] 2076465.200154: xprt_reserve: >>>>> task:57752@6 xid=0x00a0aaf9 >>>>> <...>-1461944 [001] 2076465.200155: rpc_task_run_action: >>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT >>>>> runstate=RUNNING|ACTIVE status=0 action=call_reserveresult >>>>> <...>-1461944 [001] 2076465.200156: rpc_task_run_action: >>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT >>>>> runstate=RUNNING|ACTIVE status=0 action=call_refresh >>>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action: >>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT >>>>> runstate=RUNNING|ACTIVE status=0 action=call_refreshresult >>>>> <...>-1461944 [001] 2076465.200163: rpc_task_run_action: >>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT >>>>> runstate=RUNNING|ACTIVE status=0 action=call_allocate >>>>> <...>-1461944 [001] 2076465.200168: rpc_buf_alloc: >>>>> task:57752@6 callsize=548 recvsize=104 status=0 >>>>> <...>-1461944 [001] 2076465.200168: rpc_task_run_action: >>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT >>>>> runstate=RUNNING|ACTIVE status=0 action=call_encode >>>>> <...>-1461944 [001] 2076465.200173: rpc_task_run_action: >>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT >>>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 >>>>> action=call_connect >>>>> <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror: >>>>> task:57752@6 tk_status=-107 rpc_status=-107 >>>>> <...>-1461944 [001] 2076465.200174: rpc_task_run_action: >>>>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT >>>>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 >>>>> action=rpc_exit_task >>>>> >>>>> It's reporting ENOTCON. I'm not really sure if this is related to copy >>>>> offload but more perhaps doing callbacks over RDMA/IB. >>>> If the client is awaiting a COPY notification callback, I can see why >>>> a lost CB channel would cause the client to wait indefinitely. >>>> The copy mechanism has to deal with this contingency... Perhaps the >>>> server could force a disconnect so that the client and server have an >>>> opportunity to re-establish the callback channel. <shrug> >>>> In any event, the trace log above shows nothing more than "hey, it's >>>> not working." Are there any rpcrdma trace events we can look at to >>>> determine why the backchannel is getting lost? >>> >>> The trace log attached to my previous mail has it enabled, along with nfsd and sunrpc. >>> I'm attaching the two files again here. >> Thanks, Timo. >> The first CB_OFFLOAD callback succeeds: >> 2076155.216687: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD >> 2076155.216704: rpc_request: task:57746@6 nfs4_cbv1 CB_OFFLOAD (async) >> 2076155.216975: rpc_stats_latency: task:57746@6 xid=0xff9faaf9 nfs4_cbv1 CB_OFFLOAD backlog=33 rtt=195 execute=282 >> 2076155.216977: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=0 >> About 305 seconds later, the autodisconnect timer fires. I'm not sure if this is the backchannel transport, but it looks suspicious: >> 2076460.314954: xprt_disconnect_auto: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND >> 2076460.314957: xprt_disconnect_done: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND >> The next CB_OFFLOAD callback fails because the xprt has been marked "disconnected" and the request's NOCONNECT flag is set. >> 2076465.200136: nfsd_cb_work: addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD >> 2076465.200151: rpc_request: task:57752@6 nfs4_cbv1 CB_OFFLOAD (async) >> 2076465.200168: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_encode >> 2076465.200173: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect >> 2076465.200174: rpc_call_rpcerror: task:57752@6 tk_status=-107 rpc_status=-107 >> 2076465.200174: rpc_task_run_action: task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 action=rpc_exit_task >> 2076465.200179: nfsd_cb_done: addr=10.110.10.252:0 client 602eb645:daa037ae status=-107 >> 2076465.200180: nfsd_cb_state: addr=10.110.10.252:0 client 602eb645:daa037ae state=FAULT >> Perhaps RPC clients for NFSv4.1+ callback should be created with >> the RPC_CLNT_CREATE_NO_IDLE_TIMEOUT flag. > > I added that flag to the callback client creation flags, and so far after a few hours of uptime, copying still works. It seems like a good thing to do in any event (barring any other observations). Can you send a patch for review? > Can't say anything about that being the appropriate fix for the issue at hand, as I lack the necessary insight into the NFS code, but I'll leave it running like that for now and observe. > > Can also gladly test any other patches. > > > Thanks, > Timo -- Chuck Lever