Hi Selvin, > -----Original Message----- > From: Selvin Xavier <selvin.xavier@xxxxxxxxxxxx> > Sent: Friday, July 12, 2019 9:16 AM > To: Parav Pandit <parav@xxxxxxxxxxxx> > Cc: Yi Zhang <yi.zhang@xxxxxxxxxx>; linux-nvme@xxxxxxxxxxxxxxxxxxx; Daniel > Jurgens <danielj@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx; Devesh > Sharma <devesh.sharma@xxxxxxxxxxxx> > Subject: Re: regression: nvme rdma with bnxt_re0 broken > > On Fri, Jul 12, 2019 at 8:19 AM Parav Pandit <parav@xxxxxxxxxxxx> wrote: > > > > Hi Yi Zhang, > > > > > -----Original Message----- > > > From: linux-rdma-owner@xxxxxxxxxxxxxxx <linux-rdma- > > > owner@xxxxxxxxxxxxxxx> On Behalf Of Yi Zhang > > > Sent: Friday, July 12, 2019 7:23 AM > > > To: Parav Pandit <parav@xxxxxxxxxxxx>; > > > linux-nvme@xxxxxxxxxxxxxxxxxxx > > > Cc: Daniel Jurgens <danielj@xxxxxxxxxxxx>; > > > linux-rdma@xxxxxxxxxxxxxxx; Devesh Sharma > > > <devesh.sharma@xxxxxxxxxxxx>; selvin.xavier@xxxxxxxxxxxx > > > Subject: Re: regression: nvme rdma with bnxt_re0 broken > > > > > > Hi Parav > > > > > > Here is the info, let me know if it's enough, thanks. > > > > > > [root@rdma-perf-07 ~]$ echo -n "module ib_core +p" > > > > /sys/kernel/debug/dynamic_debug/control > > > [root@rdma-perf-07 ~]$ ifdown bnxt_roce Device 'bnxt_roce' > > > successfully disconnected. > > > [root@rdma-perf-07 ~]$ ifup bnxt_roce Connection successfully > > > activated (D-Bus active path: > > > /org/freedesktop/NetworkManager/ActiveConnection/16) > > > [root@rdma-perf-07 ~]$ sh a.sh > > > DEV PORT INDEX GID IPv4 VER DEV > > > --- ---- ----- --- ------------ --- --- > > > bnxt_re0 1 0 fe80:0000:0000:0000:020a:f7ff:fee3:6e32 > > > v1 bnxt_roce > > > bnxt_re0 1 1 fe80:0000:0000:0000:020a:f7ff:fee3:6e32 > > > v2 bnxt_roce > > > bnxt_re0 1 10 0000:0000:0000:0000:0000:ffff:ac1f:2bbb > > > 172.31.43.187 v1 bnxt_roce.43 > > > bnxt_re0 1 11 0000:0000:0000:0000:0000:ffff:ac1f:2bbb > > > 172.31.43.187 v2 bnxt_roce.43 > > > bnxt_re0 1 2 fe80:0000:0000:0000:020a:f7ff:fee3:6e32 > > > v1 bnxt_roce.45 > > > bnxt_re0 1 3 fe80:0000:0000:0000:020a:f7ff:fee3:6e32 > > > v2 bnxt_roce.45 > > > bnxt_re0 1 4 fe80:0000:0000:0000:020a:f7ff:fee3:6e32 > > > v1 bnxt_roce.43 > > > bnxt_re0 1 5 fe80:0000:0000:0000:020a:f7ff:fee3:6e32 > > > v2 bnxt_roce.43 > > > bnxt_re0 1 6 0000:0000:0000:0000:0000:ffff:ac1f:28bb > > > 172.31.40.187 v1 bnxt_roce > > > bnxt_re0 1 7 0000:0000:0000:0000:0000:ffff:ac1f:28bb > > > 172.31.40.187 v2 bnxt_roce > > > bnxt_re0 1 8 0000:0000:0000:0000:0000:ffff:ac1f:2dbb > > > 172.31.45.187 v1 bnxt_roce.45 > > > bnxt_re0 1 9 0000:0000:0000:0000:0000:ffff:ac1f:2dbb > > > 172.31.45.187 v2 bnxt_roce.45 > > > bnxt_re1 1 0 fe80:0000:0000:0000:020a:f7ff:fee3:6e33 > > > v1 lom_2 > > > bnxt_re1 1 1 fe80:0000:0000:0000:020a:f7ff:fee3:6e33 > > > v2 lom_2 > > > cxgb4_0 1 0 0007:433b:f5b0:0000:0000:0000:0000:0000 v1 > > > cxgb4_0 2 0 0007:433b:f5b8:0000:0000:0000:0000:0000 v1 > > > hfi1_0 1 0 fe80:0000:0000:0000:0011:7501:0109:6c60 v1 > > > hfi1_0 1 1 fe80:0000:0000:0000:0006:6a00:0000:0005 v1 > > > mlx5_0 1 0 fe80:0000:0000:0000:506b:4b03:00f3:8a38 v1 > > > n_gids_found=19 > > > > > > [root@rdma-perf-07 ~]$ dmesg | tail -15 > > > [ 19.744421] IPv6: ADDRCONF(NETDEV_CHANGE): mlx5_ib0.8002: link > > > becomes ready [ 19.758371] IPv6: ADDRCONF(NETDEV_CHANGE): > > > mlx5_ib0.8004: link becomes ready [ 20.010469] hfi1 0000:d8:00.0: hfi1_0: > > > Switching to NO_DMA_RTAIL [ 20.440580] IPv6: > > > ADDRCONF(NETDEV_CHANGE): mlx5_ib0.8006: link becomes ready > > > [ 21.098510] bnxt_en 0000:19:00.0 bnxt_roce: Too many traffic classes > > > requested: 8. Max supported is 2. > > > [ 21.324341] bnxt_en 0000:19:00.0 bnxt_roce: Too many traffic classes > > > requested: 8. Max supported is 2. > > > [ 22.058647] IPv6: ADDRCONF(NETDEV_CHANGE): hfi1_opa0: link > becomes > > > ready [ 211.407329] _ib_cache_gid_del: can't delete gid > > > fe80:0000:0000:0000:020a:f7ff:fee3:6e32 error=-22 [ 211.407334] > > > _ib_cache_gid_del: can't delete gid > > > fe80:0000:0000:0000:020a:f7ff:fee3:6e32 error=-22 [ 211.425275] > > > infiniband > > > bnxt_re0: del_gid port=1 index=6 gid > > > 0000:0000:0000:0000:0000:ffff:ac1f:28bb > > > [ 211.425280] infiniband bnxt_re0: free_gid_entry_locked port=1 > > > index=6 gid 0000:0000:0000:0000:0000:ffff:ac1f:28bb > > > [ 211.425292] infiniband bnxt_re0: del_gid port=1 index=7 gid > > > 0000:0000:0000:0000:0000:ffff:ac1f:28bb > > > [ 211.425461] infiniband bnxt_re0: free_gid_entry_locked port=1 > > > index=7 gid 0000:0000:0000:0000:0000:ffff:ac1f:28bb > > > [ 225.474061] infiniband bnxt_re0: store_gid_entry port=1 index=6 > > > gid 0000:0000:0000:0000:0000:ffff:ac1f:28bb > > > [ 225.474075] infiniband bnxt_re0: store_gid_entry port=1 index=7 > > > gid 0000:0000:0000:0000:0000:ffff:ac1f:28bb > > > > > > > > GID table looks fine. > > > The GID table has fe80:0000:0000:0000:020a:f7ff:fee3:6e32 entry repeated 6 > times. 2 for each interface bnxt_roce, bnxt_roce.43 and bnxt_roce.45. Is this > expected to have same gid entries for vlan and base interfaces? As you > mentioned earlier, driver's assumption that only 2 GID entries identical (one for > RoCE v1 and one for RoCE > v2) is breaking here. > Yes, this is correct behavior. Each vlan netdev interface is in different L2 segment. Vlan netdev has this ipv6 link local address. Hence, it is added to the GID table. A given GID table entry is identified uniquely by GID+ndev+type(v1/v2). Reviewing bnxt_qplib_add_sgid() does the comparison below. if (!memcmp(&sgid_tbl->tbl[i], gid, sizeof(*gid))) { This comparison looks incomplete which ignore netdev and type. But even with that, I would expect GID entry addition failure for vlans for ipv6 link local entries. But I am puzzled now, that , with above memcmp() check, how does both v1 and v2 entries get added in your table and for vlans too? I expect add_gid() and core/cache.c add_roce_gid () to fail for the duplicate entry. But GID table that Yi Zhang dumped has these vlan entries. I am missing something. Yi Zhang, Instead of last 15 lines of dmesg, can you please share the whole dmsg log which should be enabled before creating vlans. using echo -n "module ib_core +p" /sys/kernel/debug/dynamic_debug/control Selvin, Additionally, driver shouldn't be looking at the duplicate entries. core already does it. You might only want to do for v1/v2 case as bnxt driver has some dependency with it. Can you please fix this part? > > > On 7/12/19 12:18 AM, Parav Pandit wrote: > > > > Sagi, > > > > > > > > This is better one to cc to linux-rdma. > > > > > > > > + Devesh, Selvin. > > > > > > > >> -----Original Message----- > > > >> From: Parav Pandit > > > >> Sent: Thursday, July 11, 2019 6:25 PM > > > >> To: Yi Zhang <yi.zhang@xxxxxxxxxx>; > > > >> linux-nvme@xxxxxxxxxxxxxxxxxxx > > > >> Cc: Daniel Jurgens <danielj@xxxxxxxxxxxx> > > > >> Subject: RE: regression: nvme rdma with bnxt_re0 broken > > > >> > > > >> Hi Yi Zhang, > > > >> > > > >>> -----Original Message----- > > > >>> From: Yi Zhang <yi.zhang@xxxxxxxxxx> > > > >>> Sent: Thursday, July 11, 2019 3:17 PM > > > >>> To: linux-nvme@xxxxxxxxxxxxxxxxxxx > > > >>> Cc: Daniel Jurgens <danielj@xxxxxxxxxxxx>; Parav Pandit > > > >>> <parav@xxxxxxxxxxxx> > > > >>> Subject: regression: nvme rdma with bnxt_re0 broken > > > >>> > > > >>> Hello > > > >>> > > > >>> 'nvme connect' failed when use bnxt_re0 on latest upstream > > > >>> build[1], by bisecting I found it was introduced from v5.2.0-rc1 > > > >>> with [2], it works after I revert it. > > > >>> Let me know if you need more info, thanks. > > > >>> > > > >>> [1] > > > >>> [root@rdma-perf-07 ~]$ nvme connect -t rdma -a 172.31.40.125 -s > > > >>> 4420 -n testnqn Failed to write to /dev/nvme-fabrics: Bad > > > >>> address > > > >>> > > > >>> [root@rdma-perf-07 ~]$ dmesg > > > >>> [ 476.320742] bnxt_en 0000:19:00.0: QPLIB: cmdq[0x4b9]=0x15 > > > >>> status 0x5 > > > > Devesh, Selvin, > > > > What does this error mean? > > bnxt_qplib_create_ah() is failing. > > > We are passing a wrong index for the GID to FW because of the assumption > mentioned earlier. > FW is failing command due to this. > > > > >>> [ 476.327103] infiniband bnxt_re0: Failed to allocate HW AH [ > > > >>> 476.332525] nvme nvme2: rdma_connect failed (-14). > > > >>> [ 476.343552] nvme nvme2: rdma connection establishment failed > > > >>> (-14) > > > >>> > > > >>> [root@rdma-perf-07 ~]$ lspci | grep -i Broadcom > > > >>> 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries > > > >>> NetXtreme > > > >>> BCM5720 2-port Gigabit Ethernet PCIe > > > >>> 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries > > > >>> NetXtreme > > > >>> BCM5720 2-port Gigabit Ethernet PCIe > > > >>> 18:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 > > > >>> [Fury] (rev > > > >>> 02) > > > >>> 19:00.0 Ethernet controller: Broadcom Inc. and subsidiaries > > > >>> BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01) > > > >>> 19:00.1 Ethernet controller: Broadcom Inc. and subsidiaries > > > >>> BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01) > > > >>> > > > >>> > > > >>> [2] > > > >>> commit 823b23da71132b80d9f41ab667c68b112455f3b6 > > > >>> Author: Parav Pandit <parav@xxxxxxxxxxxx> > > > >>> Date: Wed Apr 10 11:23:03 2019 +0300 > > > >>> > > > >>> IB/core: Allow vlan link local address based RoCE GIDs > > > >>> > > > >>> IPv6 link local address for a VLAN netdevice has nothing to do with > its > > > >>> resemblance with the default GID, because VLAN link local GID is in > > > >>> different layer 2 domain. > > > >>> > > > >>> Now that RoCE MAD packet processing and route resolution > > > >>> consider > > > the > > > >>> right GID index, there is no need for an unnecessary check > > > >>> which > > > prevents > > > >>> the addition of vlan based IPv6 link local GIDs. > > > >>> > > > >>> Signed-off-by: Parav Pandit <parav@xxxxxxxxxxxx> > > > >>> Reviewed-by: Daniel Jurgens <danielj@xxxxxxxxxxxx> > > > >>> Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx> > > > >>> Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx> > > > >>> > > > >>> > > > >>> > > > >>> Best Regards, > > > >>> Yi Zhang > > > >>> > > > >> I need some more information from you to debug this issue as I > > > >> don’t have the hw. > > > >> The highlighted patch added support for IPv6 link local address > > > >> for vlan. I am unsure how this can affect IPv4 AH creation for > > > >> which there is > > > failure. > > > >> > > > >> 1. Before you assign the IP address to the netdevice, Please do, > > > >> echo -n "module ib_core +p" > > > > >> /sys/kernel/debug/dynamic_debug/control > > > >> > > > >> Please share below output before doing nvme connect. > > > >> 2. Output of script [1] > > > >> $ show_gids script > > > >> If getting this script is problematic, share the output of, > > > >> > > > >> $ cat /sys/class/infiniband/bnxt_re0/ports/1/gids/* > > > >> $ cat /sys/class/infiniband/bnxt_re0/ports/1/gid_attrs/ndevs/* > > > >> $ ip link show > > > >> $ip addr show > > > >> $ dmesg > > > >> > > > >> [1] > > > >> https://community.mellanox.com/s/article/understanding-show-gids- > > > >> script#jive_content_id_The_Script > > > >> > > > >> I suspect that driver's assumption about GID indices might have > > > >> gone wrong here in drivers/infiniband/hw/bnxt_re/ib_verbs.c. > > > >> Lets see about results to confirm that. > > > > _______________________________________________ > > > > Linux-nvme mailing list > > > > Linux-nvme@xxxxxxxxxxxxxxxxxxx > > > > http://lists.infradead.org/mailman/listinfo/linux-nvme