On Tue, Nov 17, 2020 at 04:54:30PM +0100, Timo Rothenpieler wrote: > The most likely candidate for this seems to be > 0ec52f0194638e2d284ad55eba5a7aff753de1b9(RDMA/mlx5: Disable > IB_DEVICE_MEM_MGT_EXTENSIONS if IB_WR_REG_MR can't work) which was merged > in 5.4.73. There were also a lot of mlx5 related changes in 5.4.71 though. > Though since this is a production system, I cannot sensibly bisect this. It is very unlikely, neither mlx5 or ipoib read that bit. That error print is very bad: Nov 17 01:12:58 store01 kernel: mlx5_core 0000:01:00.0: cmd_work_handler:887:(pid 383): failed to allocate command entry It really shouldn't happen This is more likely the cause: commit 073fff8102062cd675170ceb54d90da22fe7e668 Author: Eran Ben Elisha <eranbe@xxxxxxxxxxxx> Date: Tue Aug 4 10:40:21 2020 +0300 net/mlx5: Avoid possible free of command entry while timeout comp handler [ Upstream commit 50b2412b7e7862c5af0cbf4b10d93bc5c712d021 ] Upon command completion timeout, driver simulates a forced command completion. In a rare case where real interrupt for that command arrives simultaneously, it might release the command entry while the forced handler might still access it. Most likely it is missing some element. Eran, can you check why v5.4.77 is totaly broken? Jason