Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

Olga Kornievskaia <aglo@xxxxxxxxx> · Fri, 19 Feb 2021 15:43:05 -0500

On Fri, Feb 19, 2021 at 1:48 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
>
>
> > On Feb 19, 2021, at 1:01 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
> >
> > On 19.02.2021 18:48, Chuck Lever wrote:
> >>> On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> >>>
> >>> On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On 18.02.2021 19:30, Olga Kornievskaia wrote:
> >>>>> Thank you for getting tracepoints from a busy server but can you get
> >>>>> more? As suspected, the server is having issues sending the callback.
> >>>>> I'm not sure why. Any chance to turn on the server's sunrpc
> >>>>> tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
> >>>>> we can any more info about why it's failing?
> >>>>
> >>>> I isolated out two of the machines on that cluster now, one acting as
> >>>> NFS server from an ext4 mount, the other is the same client as before.
> >>>> That way I managed to capture a trace and ibdump of an entire cycle:
> >>>> mount + successful copy + 5 minutes later a copy that got stuck
> >>>>
> >>>> Next to no noise happened during those traces, you can find them attached.
> >>>>
> >>>> Another observation made due to this: unmount and re-mounting the NFS
> >>>> share also gets it back into working condition for a while, no reboot
> >>>> necessary.
> >>>> During this trace, I got "lucky", and after just 5 minutes of waiting,
> >>>> it got stuck.
> >>>>
> >>>> Before that, I had a run of mount + trying to copy every 5 minutes where
> >>>> it ran for 45 minutes without getting stuck. At which point I decided to
> >>>> remount once more.
> >>>
> >>> Timo, thank you for gathering the debug info.
> >>>
> >>> Chuck, I need your help. Why would the server lose a callback channel?
> >>>
> >>>           <...>-1461944 [001] 2076465.200151: rpc_request:
> >>> task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
> >>>           <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_reserve
> >>>           <...>-1461944 [001] 2076465.200154: xprt_reserve:
> >>> task:57752@6 xid=0x00a0aaf9
> >>>           <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
> >>>           <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_refresh
> >>>           <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
> >>>           <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_allocate
> >>>           <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
> >>> task:57752@6 callsize=548 recvsize=104 status=0
> >>>           <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE status=0 action=call_encode
> >>>           <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
> >>> action=call_connect
> >>>           <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
> >>> task:57752@6 tk_status=-107 rpc_status=-107
> >>>           <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
> >>> task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
> >>> runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
> >>> action=rpc_exit_task
> >>>
> >>> It's reporting ENOTCON. I'm not really sure if this is related to copy
> >>> offload but more perhaps doing callbacks over RDMA/IB.
> >> If the client is awaiting a COPY notification callback, I can see why
> >> a lost CB channel would cause the client to wait indefinitely.
> >> The copy mechanism has to deal with this contingency... Perhaps the
> >> server could force a disconnect so that the client and server have an
> >> opportunity to re-establish the callback channel. <shrug>
> >> In any event, the trace log above shows nothing more than "hey, it's
> >> not working." Are there any rpcrdma trace events we can look at to
> >> determine why the backchannel is getting lost?
> >
> > The trace log attached to my previous mail has it enabled, along with nfsd and sunrpc.
> > I'm attaching the two files again here.
>
> Thanks, Timo.
>
> The first CB_OFFLOAD callback succeeds:
>
> 2076155.216687: nfsd_cb_work:         addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
> 2076155.216704: rpc_request:          task:57746@6 nfs4_cbv1 CB_OFFLOAD (async)
>
> 2076155.216975: rpc_stats_latency:    task:57746@6 xid=0xff9faaf9 nfs4_cbv1 CB_OFFLOAD backlog=33 rtt=195 execute=282
> 2076155.216977: nfsd_cb_done:         addr=10.110.10.252:0 client 602eb645:daa037ae status=0
>
>
> About 305 seconds later, the autodisconnect timer fires. I'm not sure if this is the backchannel transport, but it looks suspicious:
>
> 2076460.314954: xprt_disconnect_auto: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
> 2076460.314957: xprt_disconnect_done: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
>
>
> The next CB_OFFLOAD callback fails because the xprt has been marked "disconnected" and the request's NOCONNECT flag is set.
>
> 2076465.200136: nfsd_cb_work:         addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
> 2076465.200151: rpc_request:          task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
>
> 2076465.200168: rpc_task_run_action:  task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_encode
> 2076465.200173: rpc_task_run_action:  task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect
> 2076465.200174: rpc_call_rpcerror:    task:57752@6 tk_status=-107 rpc_status=-107
> 2076465.200174: rpc_task_run_action:  task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 action=rpc_exit_task
> 2076465.200179: nfsd_cb_done:         addr=10.110.10.252:0 client 602eb645:daa037ae status=-107
> 2076465.200180: nfsd_cb_state:        addr=10.110.10.252:0 client 602eb645:daa037ae state=FAULT
>
>
> Perhaps RPC clients for NFSv4.1+ callback should be created with
> the RPC_CLNT_CREATE_NO_IDLE_TIMEOUT flag.

Do you know if this callback behavior is new? The same problem would
exist with delegation recalls.

>
>
> --
> Chuck Lever
>
>
>