On 7/19/2023 9:40 AM, Jonathan Nicklin wrote: > External email: Use caution opening links or attachments > > > Thanks for the reply and the link. I believe that is a different > failure mode involving __ib_cache_gid_add(). In my case, there is no > traffic (the link is completely idle). And, the failure mode is > persistent no matter how many times I "toggle the link." > > > -Jonathan > > On Tue, Jul 18, 2023 at 9:28 PM William Kucharski > <william.kucharski@xxxxxxxxxx> wrote: >> >> Yes - it's NVIDIA issue 2326155: >> >> https://docs.nvidia.com/networking/display/MLNXOFEDv590560113/Known+Issues >> >> William Kucharski >> >> On Jul 18, 2023, at 19:06, Jonathan Nicklin <jnicklin@xxxxxxxxxxxxxxx> wrote: >> >> Hello, >> >> I've encountered an unexpected error configuring RDMA/ROCEV2 with one of our >> 200G ConnectX6 NICS. This issue reproduces consistently on 5.4.249 and 6.4.3. >> >> dmesg output: >> >> [ 9.863871] mlx5_core 0000:01:00.0: mlx5_cmd_out_err:803:(pid >> 1440): SET_ROCE_ADDRESS(0x761) op_mod(0x0) failed, status bad >> parameter(0x3), syndrome (0x63c66), err(-22) >> [ 9.881250] infiniband mlx5_2: add_roce_gid GID add failed port=1 index=0 >> [ 9.889095] __ib_cache_gid_add: unable to add gid >> fe80:0000:0000:0000:ad3e:e3ff:fe92:b31b error=-22 >> Seems this syndrome indicates it's a multicast source_mac which is not allowed. For more information please contact your Nvidia support representative, thanks. >> Device Type: ConnectX6 >> Part Number: MCX653105A-HDA_Ax >> Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE ... >> PSID: MT_0000000223 >> PCI Device Name: 0000:01:00.0 >> >> Firmware is up to date. LINK_TYPE is to ETH(2) and ROCE_CONTROL is >> ROCE_ENABLE(2). >> >> Has anyone seen this syndrome? Any advice or assistance is appreciated. >> >> Thanks, >> -Jonathan