Spurious instability with NFSoRDMA under moderate load

Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> · Sun, 16 May 2021 19:29:50 +0200

This has happened 3 times so far over the last couple months, and I do 
not have a clear way to reproduce it.
It happens under moderate load, when lots of nodes read and write from 
the server. Though not in any super intense way. Just normal program 
execution, writing of light logs, and other standard tasks.

The issues on the clients manifest in a multitude of ways. Most of the 
time, random IO operations just fail, rarely hang indefinitely and make 
the process unkillable.
Another example would be: "Failed to remove 
'.../.nfs00000000007b03af00000001': Device or resource busy"

Once a client is in that state, the only way to get it back into order 
is a reboot.

On the server side, a single error cqe is dumped each time this problem 
happened. So far, I always rebooted the server as well, to make sure 
everything is back in order. Not sure if that is strictly necessary.

[561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error cqe
[561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 d2

[985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error cqe
[985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b d3

[1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error cqe
[1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9 d3

These all happened under different Versions of the 5.10 Kernel. The last 
one under 5.10.32 today.

Switching all clients to TCP seems to make NFS works perfectly reliable.

I'm not sure how to read those error dumps, so help there would be 
appreciated.

Could this be similar to spurious issues you get with UDP, where dropped 
packages cause havoc? Though I would not expect heavy load on IB to 
cause an error cqe to be logged.

Thanks,
Timo

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature