On 2025/2/10 21:52, Halil Pasic wrote: > On Fri, 10 Jan 2025 13:43:44 +0800 > Guangguan Wang <guangguan.wang@xxxxxxxxxxxxxxxxx> wrote: > >> We want to use SMC in container on cloud environment, and encounter problem >> when using smc_pnet with commit 890a2cb4a966. In container, there have choices >> of different container network, such as directly using host network, virtual >> network IPVLAN, veth, etc. Different choices of container network have different >> netdev hierarchy. Examples of netdev hierarchy show below. (eth0 and eth1 in host >> below is the netdev directly related to the physical device). >> _______________________________ ________________________________ >> | _________________ | | _________________ | >> | |POD | | | |POD __________ | | >> | | | | | | |upper_ndev| | | >> | | eth0_________ | | | |eth0|__________| | | >> | |____| |__| | | |_______|_________| | >> | | | | | |lower netdev | >> | | | | | __|______ | >> | eth1|base_ndev| eth0_______ | | eth1| | eth0_______ | >> | | | | RDMA || | |base_ndev| | RDMA || >> | host |_________| |_______|| | host |_________| |_______|| >> ———————————————————————————————- ———————————————————————————————- >> netdev hierarchy if directly netdev hierarchy if using IPVLAN >> using host network >> _______________________________ >> | _____________________ | >> | |POD _________ | | >> | | |base_ndev|| | >> | |eth0(veth)|_________|| | >> | |____________|________| | >> | |pairs | >> | _______|_ | >> | | | eth0_______ | >> | veth|base_ndev| | RDMA || >> | |_________| |_______|| >> | _________ | >> | eth1|base_ndev| | >> | host |_________| | >> ——————————————————————————————— >> netdev hierarchy if using veth >> >> Due to some reasons, the eth1 in host is not RDMA attached netdevice, pnetid >> is needed to map the eth1(in host) with RDMA device so that POD can do SMC-R. >> Because the eth1(in host) is managed by CNI plugin(such as Terway, network >> management plugin in container environment), and in cloud environment the >> eth(in host) can dynamically be inserted by CNI when POD create and dynamically >> be removed by CNI when POD destroy and no POD related to the eth(in host) >> anymore. > > I'm pretty clueless when it comes to the details of CNI but I think > I'm barely able to follow. Nevertheless if you have the feeling that > my extrapolations are wrong, please do point it out. > >> It is hard for us to config the pnetid to the eth1(in host). So we >> config the pnetid to the netdevice which can be seen in POD. > > Hm, this sounds like you could set PNETID on eth1 (in host) for each of > the cases and everything would be cool (and would work), but because CNI > and the environment do not support it, or supports it in a very > inconvenient way, you are looking for a workaround where PNETID is set > in the POD. Is that right? Or did I get something wrong? Right. > >> When do SMC-R, both >> the container directly using host network and the container using veth network >> can successfully match the RDMA device, because the configured pnetid netdev is a >> base_ndev. But the container using IPVLAN can not successfully match the RDMA >> device and 0x03030000 fallback happens, because the configured pnetid netdev is >> not a base_ndev. Additionally, if config pnetid to the eth1(in host) also can not >> work for matching RDMA device when using veth network and doing SMC-R in POD. > > That I guess answers my question from the first paragraph. Setting > PNETID on eth1 (host) would not be sufficient for veth. Right? Right. It is also one of the reasons for setting PNETID in POD. > > Another silly question: is making the PNETID basically a part of the Pod > definition shifting PNETID from the realm of infrastructure (i.e. > configured by the cloud provider) to the ream of an application (i.e. > configured by the tenant)? No, application do not need to know the PNETID configuration. We have a plugin in Kubernetes. When deploying a POD, the plugin will automatically add an initContainer to the POD and automatically configure the PNETID in initContainer. > > AFAIU veth (host) is bridged (or similar) to eth1 (host) and that is in > the host, and this is where we make sure that the requirements for SMC-R > are satisfied. > > But veth (host) could be attached to eth3 which is on a network not > reachable via eth0 (host) or eth1 (host). In that case the pod could > still set PNETID on veth (POD). Or? > Sorry, I forget to add a precondition, it is a single-tenant scenario, and all of the ethX in host are in the same VPC(A term in Cloud, can be simply understood as a private network domain). The ethX in the same VPC means they have the same network reachability. Therefore, in this scenario, we will not encounter the situation you mentioned. >> >> My patch can resolve the problem we encountered and also can unify the pnetid setup >> of different network choices list above, assuming the pnetid is not limited to >> config to the base_ndev directly related to the physical device(indeed, the current >> implementation has not limited it yet). > > I see some problems here, but I'm afraid we see different problems. For > me not being able to set eth0 (veth/POD)'s PNEDID from the host is a > problem. Please notice that with the current implementation users can > only control the PNETID if infrastructure does not do so in the first > place. > > > Can you please help me reason about this? I'm unfortunately lacking > Kubernetes skills here, and it is difficult for me to think along. Yes, it is also a problem that not being able to set eth0 (veth/POD)'s PNEDID from the host. Even if the eth1(host) have hardware PNETID, the eth0 (veth/POD) can not search the hardware PNETID. Because the eth0 (veth/POD) and eth1(host) are not in one netdev hierarchy. But the two netdev hierarchies have relationship. Maybe search PNETID in all related netdev hierarchies can help resolve this. For example when finding the base_ndev, if the base_ndev is a netdev has relationship with other netdev(veth .etc) then jump to the related netdev hierarchy through the relationship to iteratively find the base_ndev. It is an idea now. I have not do any research about it yet and I am not sure if it is feasible. Thanks, Guangguan Wang > > Regards, > Halil