Thanks for the reply and the link. I believe that is a different failure mode involving __ib_cache_gid_add(). In my case, there is no traffic (the link is completely idle). And, the failure mode is persistent no matter how many times I "toggle the link." -Jonathan On Tue, Jul 18, 2023 at 9:28 PM William Kucharski <william.kucharski@xxxxxxxxxx> wrote: > > Yes - it's NVIDIA issue 2326155: > > https://docs.nvidia.com/networking/display/MLNXOFEDv590560113/Known+Issues > > William Kucharski > > On Jul 18, 2023, at 19:06, Jonathan Nicklin <jnicklin@xxxxxxxxxxxxxxx> wrote: > > Hello, > > I've encountered an unexpected error configuring RDMA/ROCEV2 with one of our > 200G ConnectX6 NICS. This issue reproduces consistently on 5.4.249 and 6.4.3. > > dmesg output: > > [ 9.863871] mlx5_core 0000:01:00.0: mlx5_cmd_out_err:803:(pid > 1440): SET_ROCE_ADDRESS(0x761) op_mod(0x0) failed, status bad > parameter(0x3), syndrome (0x63c66), err(-22) > [ 9.881250] infiniband mlx5_2: add_roce_gid GID add failed port=1 index=0 > [ 9.889095] __ib_cache_gid_add: unable to add gid > fe80:0000:0000:0000:ad3e:e3ff:fe92:b31b error=-22 > > Device Type: ConnectX6 > Part Number: MCX653105A-HDA_Ax > Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE ... > PSID: MT_0000000223 > PCI Device Name: 0000:01:00.0 > > Firmware is up to date. LINK_TYPE is to ETH(2) and ROCE_CONTROL is > ROCE_ENABLE(2). > > Has anyone seen this syndrome? Any advice or assistance is appreciated. > > Thanks, > -Jonathan