Re: copy_file_range() infinitely hangs on NFSv4.2 over RDMA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Timo,

Can you get a network trace? Also, you say that the copy_file_range()
(after what looks like a successful copy) never returns (and
application hangs), can you get a sysrq output of what the process's
stack (echo t > /proc/sysrq-trigger  and see what gets dumped into the
var log messages and locate your application and report what the stack
says)?

On Sat, Feb 13, 2021 at 10:41 PM Timo Rothenpieler
<timo@xxxxxxxxxxxxxxxx> wrote:
>
> On our Fileserver, running a few weeks old 5.10, we are running into a
> weird issue with NFS 4.2 Server-Side Copy and RDMA (and ZFS, though I'm
> not sure how relevant that is to the issue).
> The servers are connected via InfiniBand, on a Mellanox ConnectX-4 card,
> using the mlx5 driver.
>
> Anything using the copy_file_range() syscall to copy stuff just hangs.
> In strace, the syscall never returns.
>
> Simple way to reproduce on the client:
>  > xfs_io -fc "pwrite 0 1M" testfile
>  > xfs_io -fc "copy_range testfile" testfile.copy
>
> The second call just never exits. It sits in S+ state, with no CPU
> usage, and can easily be killed via Ctrl+C.
> I let it sit for a couple hours as well, it does not seem to ever complete.
>
> Some more observations about it:
>
> If I do a fresh reboot of the client, the operation works fine for a
> short while (like, 10~15 minutes). No load is on the system during that
> time, it's effectively idle.
>
> The operation actually does successfully copy all data. The size and
> checksum of the target file is as expected. It just never returns.
>
> This only happens when mounting via RDMA. Mounting the same NFS share
> via plain TCP has the operation work reliably.
>
> Had this issue with Kernel 5.4 already, and had hoped that 5.10 might
> have fixed it, but unfortunately it didn't.
>
> I tried two server and 30 different client machines, they all exhibit
> the exact same behaviour. So I'd carefully rule out a hardware issue.
>
>
> Any pointers on how to debug or maybe even fix this?
>
>
>
> Thanks,
> Timo



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux