Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 18, 2021 at 3:22 PM Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
>
> On 18.02.2021 19:30, Olga Kornievskaia wrote:
> > Thank you for getting tracepoints from a busy server but can you get
> > more? As suspected, the server is having issues sending the callback.
> > I'm not sure why. Any chance to turn on the server's sunrpc
> > tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if
> > we can any more info about why it's failing?
>
> I isolated out two of the machines on that cluster now, one acting as
> NFS server from an ext4 mount, the other is the same client as before.
> That way I managed to capture a trace and ibdump of an entire cycle:
> mount + successful copy + 5 minutes later a copy that got stuck
>
> Next to no noise happened during those traces, you can find them attached.
>
> Another observation made due to this: unmount and re-mounting the NFS
> share also gets it back into working condition for a while, no reboot
> necessary.
> During this trace, I got "lucky", and after just 5 minutes of waiting,
> it got stuck.
>
> Before that, I had a run of mount + trying to copy every 5 minutes where
> it ran for 45 minutes without getting stuck. At which point I decided to
> remount once more.

Timo, thank you for gathering the debug info.

Chuck, I need your help. Why would the server lose a callback channel?

           <...>-1461944 [001] 2076465.200151: rpc_request:
task:57752@6 nfs4_cbv1 CB_OFFLOAD (async)
           <...>-1461944 [001] 2076465.200151: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_reserve
           <...>-1461944 [001] 2076465.200154: xprt_reserve:
task:57752@6 xid=0x00a0aaf9
           <...>-1461944 [001] 2076465.200155: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_reserveresult
           <...>-1461944 [001] 2076465.200156: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_refresh
           <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_refreshresult
           <...>-1461944 [001] 2076465.200163: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_allocate
           <...>-1461944 [001] 2076465.200168: rpc_buf_alloc:
task:57752@6 callsize=548 recvsize=104 status=0
           <...>-1461944 [001] 2076465.200168: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE status=0 action=call_encode
           <...>-1461944 [001] 2076465.200173: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0
action=call_connect
           <...>-1461944 [001] 2076465.200174: rpc_call_rpcerror:
task:57752@6 tk_status=-107 rpc_status=-107
           <...>-1461944 [001] 2076465.200174: rpc_task_run_action:
task:57752@6 flags=ASYNC|DYNAMIC|SOFT|NOCONNECT
runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=-107
action=rpc_exit_task

It's reporting ENOTCON. I'm not really sure if this is related to copy
offload but more perhaps doing callbacks over RDMA/IB.



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux