On 11/17/2020 9:50 PM, jgg@xxxxxxxx wrote:
On Tue, Nov 17, 2020 at 04:54:30PM +0100, Timo Rothenpieler wrote:
The most likely candidate for this seems to be
0ec52f0194638e2d284ad55eba5a7aff753de1b9(RDMA/mlx5: Disable
IB_DEVICE_MEM_MGT_EXTENSIONS if IB_WR_REG_MR can't work) which was merged
in 5.4.73. There were also a lot of mlx5 related changes in 5.4.71 though.
Though since this is a production system, I cannot sensibly bisect this.
It is very unlikely, neither mlx5 or ipoib read that bit.
That error print is very bad:
Nov 17 01:12:58 store01 kernel: mlx5_core 0000:01:00.0: cmd_work_handler:887:(pid 383): failed to allocate command entry
It really shouldn't happen
This is more likely the cause:
commit 073fff8102062cd675170ceb54d90da22fe7e668
Author: Eran Ben Elisha <eranbe@xxxxxxxxxxxx>
Date: Tue Aug 4 10:40:21 2020 +0300
net/mlx5: Avoid possible free of command entry while timeout comp handler
[ Upstream commit 50b2412b7e7862c5af0cbf4b10d93bc5c712d021 ]
Upon command completion timeout, driver simulates a forced command
completion. In a rare case where real interrupt for that command arrives
simultaneously, it might release the command entry while the forced
handler might still access it.
Most likely it is missing some element.
Eran, can you check why v5.4.77 is totaly broken?
linux-5.4.y branch is missing the fixes below:
1. 1d5558b1f0de net/mlx5: poll cmd EQ in case of command timeout
2. 410bd754cd73 net/mlx5: Add retry mechanism to the command entry ...
The second fix in particular matches Timo's bug report.
It does not directly fix the offending commit, however the offending
commit raised the probability to bump with this issue.
Saeed, can you notify about it to stable maintainers? I assume every
stable branch should have all these 3 commits or non of them.
Eran
Jason