copy_file_range() infinitely hangs on NFSv4.2 over RDMA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On our Fileserver, running a few weeks old 5.10, we are running into a weird issue with NFS 4.2 Server-Side Copy and RDMA (and ZFS, though I'm not sure how relevant that is to the issue). The servers are connected via InfiniBand, on a Mellanox ConnectX-4 card, using the mlx5 driver.

Anything using the copy_file_range() syscall to copy stuff just hangs.
In strace, the syscall never returns.

Simple way to reproduce on the client:
> xfs_io -fc "pwrite 0 1M" testfile
> xfs_io -fc "copy_range testfile" testfile.copy

The second call just never exits. It sits in S+ state, with no CPU usage, and can easily be killed via Ctrl+C.
I let it sit for a couple hours as well, it does not seem to ever complete.

Some more observations about it:

If I do a fresh reboot of the client, the operation works fine for a short while (like, 10~15 minutes). No load is on the system during that time, it's effectively idle.

The operation actually does successfully copy all data. The size and checksum of the target file is as expected. It just never returns.

This only happens when mounting via RDMA. Mounting the same NFS share via plain TCP has the operation work reliably.

Had this issue with Kernel 5.4 already, and had hoped that 5.10 might have fixed it, but unfortunately it didn't.

I tried two server and 30 different client machines, they all exhibit the exact same behaviour. So I'd carefully rule out a hardware issue.


Any pointers on how to debug or maybe even fix this?



Thanks,
Timo



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux