On 21.06.2021 18:06, Timo Rothenpieler wrote:
On 17.05.2021 19:37, Timo Rothenpieler wrote:On 17.05.2021 18:27, Chuck Lever III wrote:Meanwhile, you could try 5.11 or 5.12 on your NFS server to see if the problem persists.I updated the NFS server to 5.12.4 now and will observe and maybe try to cause some mixed load.Had this running on 5.12 for a month now, and haven't observed any kind of instability so far. So I'd carefully assume that something in after 5.10, that made it into 5.11 or 5.12 fixed or greatly improved the situation.Can't really sensibly bisect this sadly, given the extremely long and uncertain turnaround times.
Ok, so this just happened again for the first time since upgrading to 5.12.Exact same thing, except this time no error cqe was dumped simultaneously (It still appeared in dmesg, but over a week before the issue showed up). So I'll assume it's unrelated to this issue.
I had no issues while running 5.12.12 and below. Recently (14 days ago or so) updated to 5.12.19, and now it's happening again. Unfortunately, with how rarely this issue happens, this can either be a regression between those two versions, or it was still present all along and just never triggered for several months.
Makes me wonder if this is somehow related to the problem described in "NFS server regression in kernel 5.13 (tested w/ 5.13.9)". But the pattern of the issue does not look all that similar, given that for me, the hangs never recover, and I have RDMA in the mix.
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature