Hi Timo, Can you get a network trace? Also, you say that the copy_file_range() (after what looks like a successful copy) never returns (and application hangs), can you get a sysrq output of what the process's stack (echo t > /proc/sysrq-trigger and see what gets dumped into the var log messages and locate your application and report what the stack says)? On Sat, Feb 13, 2021 at 10:41 PM Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: > > On our Fileserver, running a few weeks old 5.10, we are running into a > weird issue with NFS 4.2 Server-Side Copy and RDMA (and ZFS, though I'm > not sure how relevant that is to the issue). > The servers are connected via InfiniBand, on a Mellanox ConnectX-4 card, > using the mlx5 driver. > > Anything using the copy_file_range() syscall to copy stuff just hangs. > In strace, the syscall never returns. > > Simple way to reproduce on the client: > > xfs_io -fc "pwrite 0 1M" testfile > > xfs_io -fc "copy_range testfile" testfile.copy > > The second call just never exits. It sits in S+ state, with no CPU > usage, and can easily be killed via Ctrl+C. > I let it sit for a couple hours as well, it does not seem to ever complete. > > Some more observations about it: > > If I do a fresh reboot of the client, the operation works fine for a > short while (like, 10~15 minutes). No load is on the system during that > time, it's effectively idle. > > The operation actually does successfully copy all data. The size and > checksum of the target file is as expected. It just never returns. > > This only happens when mounting via RDMA. Mounting the same NFS share > via plain TCP has the operation work reliably. > > Had this issue with Kernel 5.4 already, and had hoped that 5.10 might > have fixed it, but unfortunately it didn't. > > I tried two server and 30 different client machines, they all exhibit > the exact same behaviour. So I'd carefully rule out a hardware issue. > > > Any pointers on how to debug or maybe even fix this? > > > > Thanks, > Timo