On 18.02.2021 19:30, Olga Kornievskaia wrote:
Thank you for getting tracepoints from a busy server but can you get more? As suspected, the server is having issues sending the callback. I'm not sure why. Any chance to turn on the server's sunrpc tracespoints, probably both sunrpc and rdmas tracepoints, I wonder if we can any more info about why it's failing?
I isolated out two of the machines on that cluster now, one acting as NFS server from an ext4 mount, the other is the same client as before. That way I managed to capture a trace and ibdump of an entire cycle: mount + successful copy + 5 minutes later a copy that got stuck
Next to no noise happened during those traces, you can find them attached.Another observation made due to this: unmount and re-mounting the NFS share also gets it back into working condition for a while, no reboot necessary. During this trace, I got "lucky", and after just 5 minutes of waiting, it got stuck.
Before that, I had a run of mount + trying to copy every 5 minutes where it ran for 45 minutes without getting stuck. At which point I decided to remount once more.
Attachment:
sniffer.pcap.xz
Description: Binary data
Attachment:
trace.dat.xz
Description: Binary data
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature