On Thu, Feb 23, 2023 at 8:37 AM Zhu Yanjun <yanjun.zhu@xxxxxxxxx> wrote: > > 在 2023/2/14 14:06, Zhu Yanjun 写道: > > From: Zhu Yanjun <yanjun.zhu@xxxxxxxxx> > > > > When run "ip link add" command to add a rxe rdma link in a net > > namespace, normally this rxe rdma link can not work in a net > > name space. > > > > The root cause is that a sock listening on udp port 4791 is created > > in init_net when the rdma_rxe module is loaded into kernel. That is, > > the sock listening on udp port 4791 is created in init_net. Other net > > namespace is difficult to use this sock. > > > > The following commits will solve this problem. > > > > In the first commit, move the creating sock listening on udp port 4791 > > from module_init function to rdma link creating functions. That is, > > after the module rdma_rxe is loaded, the sock will not be created. > > When run "rdma link add ..." command, the sock will be created. So > > when creating a rdma link in the net namespace, the sock will be > > created in this net namespace. > > > > In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup > > will check the sock exists in the net namespace or not. If yes, rdma > > link will increase the reference count of this sock, then continue other > > jobs instead of creating a new sock to listen on udp port 4791. Since the > > network notifier is global, when the module rdma_rxe is loaded, this > > notifier will be registered. > > > > After the rdma link is created, the command "rdma link del" is to > > delete rdma link at the same time the sock is checked. If the reference > > count of this sock is greater than the sock reference count needed by > > udp tunnel, the sock reference count is decreased by one. If equal, it > > indicates that this rdma link is the last one. As such, the udp tunnel > > is shut down and the sock is closed. The above work should be > > implemented in linkdel function. But currently no dellink function in > > rxe. So the 3rd commit addes dellink function pointer. And the 4th > > commit implements the dellink function in rxe. > > > > To now, it is not necessary to keep a global variable to store the sock > > listening udp port 4791. This global variable can be replaced by the > > functions udp4_lib_lookup and udp6_lib_lookup totally. Because the > > function udp6_lib_lookup is in the fast path, a member variable l_sk6 > > is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called > > to lookup the sock, then the sock is stored in l_sk6, in the future,it > > can be used directly. > > > > All the above work has been done in init_net. And it can also work in > > the net namespace. So the init_net is replaced by the individual net > > namespace. This is what the 6th commit does. Because rxe device is > > dependent on the net device and the sock listening on udp port 4791, > > every rxe device is in exclusive mode in the individual net namespace. > > Other rdma netns operations will be considerred in the future. > > > > In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys > > functions are added. When a new net namespace is created, the init > > function will initialize the sk4 and sk6 socks. Then the 2 socks will > > be released when the net namespace is destroyed. The functions > > rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net > > namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will > > handle sk6. Then sk4 and sk6 are used in the previous commits. > > > > As the sk4 and sk6 in pernet namespace can be accessed, it is not > > necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is > > replaced with the sk6 in pernet namespace. > > > > Test steps: > > 1) Suppose that 2 NICs are in 2 different net namespaces. > > > > # ip netns exec net0 ip link > > 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP > > link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff > > altname enp5s0 > > > > # ip netns exec net1 ip link > > 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel > > link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff > > > > 2) Add rdma link in the different net namespace > > net0: > > # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 > > > > net1: > > # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 > > > > 3) Run rping test. > > net0 > > # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& > > [1] 1737 > > # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 > > verbose > > count 1 > > ... > > ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr > > ... > > > > 4) Remove the rdma links from the net namespaces. > > net0: > > # ip netns exec net0 ss -lu > > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > > UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* > > UNCONN 0 0 [::]:4791 [::]:* > > > > # ip netns exec net0 rdma link del rxe0 > > > > # ip netns exec net0 ss -lu > > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > > > > net1: > > # ip netns exec net0 ss -lu > > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > > UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* > > UNCONN 0 0 [::]:4791 [::]:* > > > > # ip netns exec net1 rdma link del rxe1 > > > > # ip netns exec net0 ss -lu > > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > > > > V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to > > verify rdma link is removed. > > 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace > > 3) Replace l_sk6 with sk6 of pernet_name_space Thanks, Tested-by: Rain River <rain.1986.08.12@xxxxxxxxx> > > > > V1->V2: Add the explicit initialization of sk6. > > Add netdev@xxxxxxxxxxxxxxx. > > Zhu Yanjun > > > > > Zhu Yanjun (8): > > RDMA/rxe: Creating listening sock in newlink function > > RDMA/rxe: Support more rdma links in init_net > > RDMA/nldev: Add dellink function pointer > > RDMA/rxe: Implement dellink in rxe > > RDMA/rxe: Replace global variable with sock lookup functions > > RDMA/rxe: add the support of net namespace > > RDMA/rxe: Add the support of net namespace notifier > > RDMA/rxe: Replace l_sk6 with sk6 in net namespace > > > > drivers/infiniband/core/nldev.c | 6 ++ > > drivers/infiniband/sw/rxe/Makefile | 3 +- > > drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- > > drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------- > > drivers/infiniband/sw/rxe/rxe_net.h | 9 +- > > drivers/infiniband/sw/rxe/rxe_ns.c | 128 ++++++++++++++++++++++++++++ > > drivers/infiniband/sw/rxe/rxe_ns.h | 11 +++ > > include/rdma/rdma_netlink.h | 2 + > > 8 files changed, 267 insertions(+), 40 deletions(-) > > create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c > > create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h > > >