-----Original Message----- From: Zhu Yanjun <yanjun.zhu@xxxxxxxxx> Sent: Friday, June 23, 2023 6:51 PM To: Bob Pearson <rpearsonhpe@xxxxxxxxx>; Zhu Yanjun <yanjun.zhu@xxxxxxxxx>; zyjzyj2000@xxxxxxxxx; jgg@xxxxxxxx; leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; parav@xxxxxxxxxx; lehrer@xxxxxxxxx Subject: Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace 在 2023/6/23 20:59, Bob Pearson 写道: > On 6/23/23 02:15, Zhu Yanjun wrote: >> 在 2023/6/22 5:27, Bob Pearson 写道: >>> On 6/21/23 16:09, Bob Pearson wrote: >>>> On 5/8/23 02:56, Zhu Yanjun wrote: >>>>> From: Zhu Yanjun <yanjun.zhu@xxxxxxxxx> >>>>> >>>>> When run "ip link add" command to add a rxe rdma link in a net >>>>> namespace, normally this rxe rdma link can not work in a net name >>>>> space. >>>>> >>>>> The root cause is that a sock listening on udp port 4791 is >>>>> created in init_net when the rdma_rxe module is loaded into >>>>> kernel. That is, the sock listening on udp port 4791 is created in >>>>> init_net. Other net namespace is difficult to use this sock. >>>>> >>>>> The following commits will solve this problem. >>>>> >>>>> In the first commit, move the creating sock listening on udp port >>>>> 4791 from module_init function to rdma link creating functions. >>>>> That is, after the module rdma_rxe is loaded, the sock will not be created. >>>>> When run "rdma link add ..." command, the sock will be created. So >>>>> when creating a rdma link in the net namespace, the sock will be >>>>> created in this net namespace. >>>>> >>>>> In the second commit, the functions udp4_lib_lookup and >>>>> udp6_lib_lookup will check the sock exists in the net namespace or >>>>> not. If yes, rdma link will increase the reference count of this >>>>> sock, then continue other jobs instead of creating a new sock to >>>>> listen on udp port 4791. Since the network notifier is global, >>>>> when the module rdma_rxe is loaded, this notifier will be registered. >>>>> >>>>> After the rdma link is created, the command "rdma link del" is to >>>>> delete rdma link at the same time the sock is checked. If the >>>>> reference count of this sock is greater than the sock reference >>>>> count needed by udp tunnel, the sock reference count is decreased >>>>> by one. If equal, it indicates that this rdma link is the last >>>>> one. As such, the udp tunnel is shut down and the sock is closed. >>>>> The above work should be implemented in linkdel function. But >>>>> currently no dellink function in rxe. So the 3rd commit addes >>>>> dellink function pointer. And the 4th commit implements the dellink function in rxe. >>>>> >>>>> To now, it is not necessary to keep a global variable to store the >>>>> sock listening udp port 4791. This global variable can be replaced >>>>> by the functions udp4_lib_lookup and udp6_lib_lookup totally. >>>>> Because the function udp6_lib_lookup is in the fast path, a member >>>>> variable l_sk6 is added to store the sock. If l_sk6 is NULL, >>>>> udp6_lib_lookup is called to lookup the sock, then the sock is >>>>> stored in l_sk6, in the future,it can be used directly. >>>>> >>>>> All the above work has been done in init_net. And it can also work >>>>> in the net namespace. So the init_net is replaced by the >>>>> individual net namespace. This is what the 6th commit does. >>>>> Because rxe device is dependent on the net device and the sock >>>>> listening on udp port 4791, every rxe device is in exclusive mode in the individual net namespace. >>>>> Other rdma netns operations will be considerred in the future. >>>>> >>>>> In the 7th commit, the >>>>> register_pernet_subsys/unregister_pernet_subsys >>>>> functions are added. When a new net namespace is created, the init >>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks >>>>> will be released when the net namespace is destroyed. The >>>>> functions >>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in >>>>> the net namespace. The functions >>>>> rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will handle sk6. Then sk4 and sk6 are used in the previous commits. >>>>> >>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not >>>>> necessary to add a new l_sk6. As such, in the 8th commit, the >>>>> l_sk6 is replaced with the sk6 in pernet namespace. >>>>> >>>>> Test steps: >>>>> 1) Suppose that 2 NICs are in 2 different net namespaces. >>>>> >>>>> # ip netns exec net0 ip link >>>>> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq >>>>> state UP >>>>> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >>>>> altname enp5s0 >>>>> >>>>> # ip netns exec net1 ip link >>>>> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc >>>>> fq_codel >>>>> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >>>>> >>>>> 2) Add rdma link in the different net namespace >>>>> net0: >>>>> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >>>>> >>>>> net1: >>>>> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >>>>> >>>>> 3) Run rping test. >>>>> net0 >>>>> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >>>>> [1] 1737 >>>>> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >>>>> verbose >>>>> count 1 >>>>> ... >>>>> ping data: rdma-ping-0: >>>>> ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >>>>> ... >>>>> >>>>> 4) Remove the rdma links from the net namespaces. >>>>> net0: >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>> Address:Port Process >>>>> UNCONN 0 0 0.0.0.0:4791 >>>>> 0.0.0.0:* >>>>> UNCONN 0 0 [::]:4791 [::]:* >>>>> >>>>> # ip netns exec net0 rdma link del rxe0 >>>>> >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>> Address:Port Process >>>>> >>>>> net1: >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>> Address:Port Process >>>>> UNCONN 0 0 0.0.0.0:4791 >>>>> 0.0.0.0:* >>>>> UNCONN 0 0 [::]:4791 [::]:* >>>>> >>>>> # ip netns exec net1 rdma link del rxe1 >>>>> >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>> Address:Port Process >>>>> >>>>> V4->V5: Rebase the commits to V6.4-rc1 >>>>> >>>>> V3->V4: Rebase the commits to rdma-next; >>>>> >>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and >>>>> V2->use "ss -lu" to >>>>> verify rdma link is removed. >>>>> 2) Add register_pernet_subsys/unregister_pernet_subsys >>>>> net namespace >>>>> 3) Replace l_sk6 with sk6 of pernet_name_space >>>>> >>>>> V1->V2: Add the explicit initialization of sk6. >>>>> >>>>> Zhu Yanjun (8): >>>>> RDMA/rxe: Creating listening sock in newlink function >>>>> RDMA/rxe: Support more rdma links in init_net >>>>> RDMA/nldev: Add dellink function pointer >>>>> RDMA/rxe: Implement dellink in rxe >>>>> RDMA/rxe: Replace global variable with sock lookup functions >>>>> RDMA/rxe: add the support of net namespace >>>>> RDMA/rxe: Add the support of net namespace notifier >>>>> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >>>>> >>>>> drivers/infiniband/core/nldev.c | 6 ++ >>>>> drivers/infiniband/sw/rxe/Makefile | 3 +- >>>>> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >>>>> drivers/infiniband/sw/rxe/rxe_net.c | 113 >>>>> +++++++++++++++++------ >>>>> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >>>>> drivers/infiniband/sw/rxe/rxe_ns.c | 134 >>>>> ++++++++++++++++++++++++++++ip netns add test >>>>> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >>>>> include/rdma/rdma_netlink.h | 2 + >>>>> 8 files changed, 279 insertions(+), 40 deletions(-) >>>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >>>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns >>>>> add test >>>>> >>>> Zhu, >>>> >>>> I did some simple experiments on netns functionality. >>>> >>>> With your patch set applied and rxe0 created on enp6s0 and rxe1 >>>> created on lo in the default namespace >>>> >>>> # sudo ip netns add test >>>> # ip netns >>>> test >>>> # sudo ip netns exec test ip link >>>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT >>>> group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> # sudo ip netns exec test ip link set dev lo up >>>> # sudo ip netns exec test ip link >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state >>>> UNKNOWN mode DEFAULT group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> # sudo ip netns exec test ip addr add dev lo >>>> fe80::0200:00ff:fe00:0000/64 >>>> [rxe doesn't work unless this IPV6 address is set] >>>> # sudo ip netns exec test ip addr >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state >>>> UNKNOWN group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> inet 127.0.0.1/8 scope host lo >>>> valid_lft forever preferred_lft forever >>>> inet6 fe80::200:ff:fe00:0/64 scope link >>>> valid_lft forever preferred_lft forever >>>> inet6 ::1/128 scope host >>>> valid_lft forever preferred_lft forever >>>> # sudo ip netns exec test ls /sys/class/infiniband >>>> rxe0 rxe1 >>>> [These show up even though the ndevs do *not* belong to >>>> the test namespace! Probably OK.] >>>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev >>>> lo >>>> # ls /sys/class/infiniband >>>> rxe0 rxe1 rxe2 >>>> [The new rxe device shows up in the default namespace. At >>>> least we're consistent.] >>>> # ib_send_bw -d rxe0 ... 192.168.0.27 >>>> [Works. Didn't break the existing rxe devices. Expected] >>>> # ib_send_bw -d rxe1 ... 127.0.0.1 >>>> [Works. Expected] >>>> # ib_send_bw -d rxe2 ... 127.0.0.1 >>>> IB device rxe2 not found >>>> Unable to find the Infiniband/RoCE device >>>> [Not work. Expected.] >>>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>>> IB device rxe2 not found >>>> Unable to find the Infiniband/RoCE device >>>> [Also not work. Turns out rxe2 device is gone after >>>> failure. Not expected.] >>>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev >>>> lo >>>> # ls /sys/class/infiniband >>>> rxe0 rxe1 rxe2 >>>> [Good. It's back] >>>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>>> [Works in test namespace! Expected.] >>>> # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 >>>> [Also works. Definitely not expected.] >>>> >>>> My take, it sort of works. But there are some serious issues. You >>>> shouldn't be able to use the >>>> rxe2 device in the default namespace. It would be nice if you >>>> couldn't see the rxe devices in each other's namespaces (Like ip >>>> link or ip addr hide other namespace's devices.) >>>> >>>> Bob >>> Forgot to mention. It also is definitely not good that a process in >>> the default namespace can destroy a rxe device in the test namespace by trying to use it. >> Thanks a lot. >> >> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace. >> >> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace. >> >> I am not sure if this is correct or not. >> >> Zhu Yanjun >> >>> Bob > I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed. > The rxe device was destroyed as a side effect of failing to open it. The GID of rxe can not be generated with lo. This is a problem. Now Chuck Lever <cel@xxxxxxxxxx> will fix it. Not sure if the problem that you confronted is related with this. Please use physical NIC to make tests again. Thanks a lot. Zhu Yanjun > > Bob That was why I added the IPV6 address by hand. That created the gid table entry. This is also a problem for all ethernet devices for distros that mangle the MAC address when creating the IPV6 address as a security measure. These include Ubuntu which I use. So I have to always add an IPV6 address based on the MAC address for any ethernet device. Bob