On 2025/1/9 11:04, Halil Pasic wrote: > On Wed, 8 Jan 2025 12:57:00 +0800 > Guangguan Wang <guangguan.wang@xxxxxxxxxxxxxxxxx> wrote: > >>> sorry for chiming in late. Wenjia is on vacation and Jan is out sick! >>> After some reading and thinking I could not figure out how 890a2cb4a966 >>> ("net/smc: rework pnet table") is broken. >> >> Before commit 890a2cb4a966: >> smc_pnet_find_roce_resource >> smc_pnet_find_roce_by_pnetid(ndev, ...) /* lookup via hardware-defined pnetid */ >> smc_pnetid_by_dev_port(base_ndev, ...) >> smc_pnet_find_roce_by_table(ndev, ...) /* lookup via SMC PNET table */ >> { >> ... >> list_for_each_entry(pnetelem, &smc_pnettable.pnetlist, list) { >> if (ndev == pnetelem->ndev) { /* notice here, it was ndev to matching pnetid element in pnet table */ >> ... >> } >> >> After commit 890a2cb4a966: >> smc_pnet_find_roce_resource >> smc_pnet_find_roce_by_pnetid >> { >> ... >> base_ndev = pnet_find_base_ndev(ndev); /* rename the variable name to base_ndev for better understanding */ >> smc_pnetid_by_dev_port(base_ndev, ...) >> smc_pnet_find_ndev_pnetid_by_table(base_ndev, ...) >> { >> ... >> list_for_each_entry(pnetelem, &smc_pnettable.pnetlist, list) { >> if (base_ndev == pnetelem->ndev) { /* notice here, it is base_ndev to matching pnetid element in pnet table */ >> ... >> } >> >> } >> >> The commit 890a2cb4a966 has changed ndev to base_ndev when matching pnetid element in pnet table. >> But in the function smc_pnet_add_eth, the pnetid is attached to the ndev itself, not the base_ndev. >> smc_pnet_add_eth(...) >> { >> ... >> ndev = dev_get_by_name(net, eth_name); >> ... >> if (new_netdev) { >> if (ndev) { >> new_pe->ndev = ndev; >> netdev_tracker_alloc(ndev, &new_pe->dev_tracker, >> GFP_ATOMIC); >> } >> list_add_tail(&new_pe->list, &pnettable->pnetlist); >> mutex_unlock(&pnettable->lock); >> } else { >> ... >> } > > I still not understand why do you think that 890a2cb4a966~1 is better > than 890a2cb4a966 even if things changed with 890a2cb4a966 which > I did not verify for myself but am willing to assume. > > Is there some particular setup that you think would benefit from > you patch? I.e. going back to the 890a2cb4a966~1 behavior I suppose. > We want to use SMC in container on cloud environment, and encounter problem when using smc_pnet with commit 890a2cb4a966. In container, there have choices of different container network, such as directly using host network, virtual network IPVLAN, veth, etc. Different choices of container network have different netdev hierarchy. Examples of netdev hierarchy show below. (eth0 and eth1 in host below is the netdev directly related to the physical device). _______________________________ ________________________________ | _________________ | | _________________ | | |POD | | | |POD __________ | | | | | | | | |upper_ndev| | | | | eth0_________ | | | |eth0|__________| | | | |____| |__| | | |_______|_________| | | | | | | |lower netdev | | | | | | __|______ | | eth1|base_ndev| eth0_______ | | eth1| | eth0_______ | | | | | RDMA || | |base_ndev| | RDMA || | host |_________| |_______|| | host |_________| |_______|| ———————————————————————————————- ———————————————————————————————- netdev hierarchy if directly netdev hierarchy if using IPVLAN using host network _______________________________ | _____________________ | | |POD _________ | | | | |base_ndev|| | | |eth0(veth)|_________|| | | |____________|________| | | |pairs | | _______|_ | | | | eth0_______ | | veth|base_ndev| | RDMA || | |_________| |_______|| | _________ | | eth1|base_ndev| | | host |_________| | ——————————————————————————————— netdev hierarchy if using veth Due to some reasons, the eth1 in host is not RDMA attached netdevice, pnetid is needed to map the eth1(in host) with RDMA device so that POD can do SMC-R. Because the eth1(in host) is managed by CNI plugin(such as Terway, network management plugin in container environment), and in cloud environment the eth(in host) can dynamically be inserted by CNI when POD create and dynamically be removed by CNI when POD destroy and no POD related to the eth(in host) anymore. It is hard for us to config the pnetid to the eth1(in host). So we config the pnetid to the netdevice which can be seen in POD. When do SMC-R, both the container directly using host network and the container using veth network can successfully match the RDMA device, because the configured pnetid netdev is a base_ndev. But the container using IPVLAN can not successfully match the RDMA device and 0x03030000 fallback happens, because the configured pnetid netdev is not a base_ndev. Additionally, if config pnetid to the eth1(in host) also can not work for matching RDMA device when using veth network and doing SMC-R in POD. My patch can resolve the problem we encountered and also can unify the pnetid setup of different network choices list above, assuming the pnetid is not limited to config to the base_ndev directly related to the physical device(indeed, the current implementation has not limited it yet). > I think I showed a valid and practical setup that would break with your > patch as is. Do you agree with that statement? Did you mean " Now for something like a bond of two OSA interfaces, I would expect the two legs of the bond to probably have a "HW PNETID", but the netdev representing the bond itself won't have one unless the Linux admin defines a software PNETID, which is work, and can't have a HW PNETID because it is a software construct within Linux. Breaking for example an active-backup bond setup where the legs have HW PNETIDs and the admin did not bother to specify a PNETID for the bond is not acceptable. " ? If the legs have HW pnetids, add pnetid to bond netdev will fail as smc_pnet_add_eth will check whether the base_ndev already have HW pnetid. If the legs without HW pnetids, and admin add pnetids to legs through smc_pnet. Yes, my patch will break the setup. What Paolo suggests(both checking ndev and base_ndev, and replace || by && )can help compatible with the setup. Thanks, Guangguan Wang > > Regards, > Halil