Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> · Sat, 20 Feb 2021 22:03:29 +0100

On 19.02.2021 19:48, Chuck Lever wrote:

On Feb 19, 2021, at 1:01 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:

On 19.02.2021 18:48, Chuck Lever wrote:
On Feb 19, 2021, at 12:38 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:

On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:

On 18.02.2021 19:30, Olga Kornievskaia wrote:
Thank you for getting tracepoints from a busy server but can you get
more? As suspected, the server is having issues sending the callback.
I'm not sure why. Any chance to turn on the server's sunrpc
tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
we can any more info about why it's failing?

I isolated out two of the machines on that cluster now, one acting as
NFS server from an ext4 mount, the other is the same client as before.
That way I managed to capture a trace and ibdump of an entire cycle:
mount + successful copy + 5 minutes later a copy that got stuck

Next to no noise happened during those traces, you can find them attached.

Another observation made due to this: unmount and re-mounting the NFS
share also gets it back into working condition for a while, no reboot
necessary.
During this trace, I got "lucky", and after just 5 minutes of waiting,
it got stuck.

Before that, I had a run of mount + trying to copy every 5 minutes where
it ran for 45 minutes without getting stuck. At which point I decided to
remount once more.

Timo, thank you for gathering the debug info.

Chuck, I need your help. Why would the server lose a callback channel?

           <...>-1461944 [001] 2076465.200151: rpc_request:
task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
           <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_reserve
           <...>-1461944 [001] 2076465.200154: xprt_reserve:
task:57752@6 xid=0x00a0aaf9
           <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
           <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_refresh
           <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
           <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_allocate
           <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
task:57752@6 callsize=548 recvsize=104 status=0
           <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_encode
           <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
action=call_connect
           <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
task:57752@6 tk_status=-107 rpc_status=-107
           <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
action=rpc_exit_task

It's reporting ENOTCON. I'm not really sure if this is related to copy
offload but more perhaps doing callbacks over RDMA/IB.
If the client is awaiting a COPY notification callback, I can see why
a lost CB channel would cause the client to wait indefinitely.
The copy mechanism has to deal with this contingency... Perhaps the
server could force a disconnect so that the client and server have an
opportunity to re-establish the callback channel. <shrug>
In any event, the trace log above shows nothing more than "hey, it's
not working." Are there any rpcrdma trace events we can look at to
determine why the backchannel is getting lost?

The trace log attached to my previous mail has it enabled, along with nfsd and sunrpc.
I'm attaching the two files again here.

Thanks, Timo.

The first CB_OFFLOAD callback succeeds:

2076155.216687: nfsd_cb_work:         addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
2076155.216704: rpc_request:          task:57746@6 nfs4_cbv1 CB_OFFLOAD (async)

2076155.216975: rpc_stats_latency:    task:57746@6 xid=0xff9faaf9 nfs4_cbv1 CB_OFFLOAD backlog=33 rtt=195 execute=282
2076155.216977: nfsd_cb_done:         addr=10.110.10.252:0 client 602eb645:daa037ae status=0

About 305 seconds later, the autodisconnect timer fires. I'm not sure if this is the backchannel transport, but it looks suspicious:

2076460.314954: xprt_disconnect_auto: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND
2076460.314957: xprt_disconnect_done: peer=[10.110.10.252]:0 state=LOCKED|CONNECTED|BOUND

The next CB_OFFLOAD callback fails because the xprt has been marked "disconnected" and the request's NOCONNECT flag is set.

2076465.200136: nfsd_cb_work:         addr=10.110.10.252:0 client 602eb645:daa037ae procedure=CB_OFFLOAD
2076465.200151: rpc_request:          task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)

2076465.200168: rpc_task_run_action:  task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE status=0 action=call_encode
2076465.200173: rpc_task_run_action:  task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect
2076465.200174: rpc_call_rpcerror:    task:57752@6 tk_status=-107 rpc_status=-107
2076465.200174: rpc_task_run_action:  task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107 action=rpc_exit_task
2076465.200179: nfsd_cb_done:         addr=10.110.10.252:0 client 602eb645:daa037ae status=-107
2076465.200180: nfsd_cb_state:        addr=10.110.10.252:0 client 602eb645:daa037ae state=FAULT

Perhaps RPC clients for NFSv4.1+ callback should be created with
the RPC_CLNT_CREATE_NO_IDLE_TIMEOUT flag.

I added that flag to the callback client creation flags, and so far 
after a few hours of uptime, copying still works.
Can't say anything about that being the appropriate fix for the issue at 
hand, as I lack the necessary insight into the NFS code, but I'll leave 
it running like that for now and observe.

Can also gladly test any other patches.

Thanks,
Timo

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature