Issue after 5.4.70->5.4.77 update: mlx5_core: reg_mr_callback: async reg mr failed. status -11

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This has started happening after I upgraded from 5.4.70 to 5.4.77, on a "Mellanox Technologies MT27700 Family [ConnectX-4]".

On every bootup, the following messages appear in dmesg:

store01 ~ # journalctl -b | grep mlx5
Nov 17 01:25:23 store01 kernel: mlx5_core 0000:01:00.0: firmware version: 12.28.1002
Nov 17 01:25:23 store01 kernel: mlx5_core 0000:01:00.0: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
Nov 17 01:25:23 store01 kernel: mlx5_core 0000:01:00.0: Port module event: module 0, Cable plugged
Nov 17 01:25:23 store01 kernel: mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0
Nov 17 01:25:23 store01 kernel: mlx5_core 0000:01:00.0: cmd_work_handler:887:(pid 376): failed to allocate command entry
Nov 17 01:25:23 store01 kernel: infiniband mlx5_0: reg_mr_callback:104:(pid 376): async reg mr failed. status -11
Nov 17 01:25:23 store01 kernel: mlx5_core 0000:01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
Nov 17 01:25:23 store01 kernel: mlx5_core 0000:01:00.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
Nov 17 01:25:23 store01 kernel: mlx5_core 0000:01:00.0 ibp1s0: renamed from ib0

Other than those two error messages, the system and adapter appears to work fine.
However, sporadically, the issue extends itself and fails to bring up IPoIB:

store01 ~ # journalctl -b -1 | grep mlx5
Nov 17 01:12:58 store01 kernel: mlx5_core 0000:01:00.0: firmware version: 12.28.1002
Nov 17 01:12:58 store01 kernel: mlx5_core 0000:01:00.0: 126.016 Gb/s available PCIe bandwidth (8 GT/s x16 link)
Nov 17 01:12:58 store01 kernel: mlx5_core 0000:01:00.0: Port module event: module 0, Cable plugged
Nov 17 01:12:58 store01 kernel: mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0
Nov 17 01:12:58 store01 kernel: mlx5_core 0000:01:00.0: cmd_work_handler:887:(pid 383): failed to allocate command entry
Nov 17 01:12:58 store01 kernel: infiniband mlx5_0: reg_mr_callback:104:(pid 383): async reg mr failed. status -11
Nov 17 01:12:58 store01 kernel: mlx5_core 0000:01:00.0: cmd_work_handler:887:(pid 383): failed to allocate command entry
Nov 17 01:12:58 store01 kernel: mlx5_core 0000:01:00.0: mlx5e_create_mdev_resources:104:(pid 1): alloc td failed, -11
Nov 17 01:12:58 store01 kernel: mlx5_0, 1: ipoib_intf_alloc failed -11

When that happens, only another reboot fixes IPoIB.
Neither of those issues are a thing when booting 5.4.70.

The most likely candidate for this seems to be 0ec52f0194638e2d284ad55eba5a7aff753de1b9(RDMA/mlx5: Disable IB_DEVICE_MEM_MGT_EXTENSIONS if IB_WR_REG_MR can't work) which was merged in 5.4.73. There were also a lot of mlx5 related changes in 5.4.71 though.
Though since this is a production system, I cannot sensibly bisect this.


Any ideas on how to mitigate this, like backporting more patches or changing some settings are appreciated.



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux