RE: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



-----Original Message-----
From: Zhu Yanjun <yanjun.zhu@xxxxxxxxx> 
Sent: Friday, June 23, 2023 6:51 PM
To: Bob Pearson <rpearsonhpe@xxxxxxxxx>; Zhu Yanjun <yanjun.zhu@xxxxxxxxx>; zyjzyj2000@xxxxxxxxx; jgg@xxxxxxxx; leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; parav@xxxxxxxxxx; lehrer@xxxxxxxxx
Subject: Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace


在 2023/6/23 20:59, Bob Pearson 写道:
> On 6/23/23 02:15, Zhu Yanjun wrote:
>> 在 2023/6/22 5:27, Bob Pearson 写道:
>>> On 6/21/23 16:09, Bob Pearson wrote:
>>>> On 5/8/23 02:56, Zhu Yanjun wrote:
>>>>> From: Zhu Yanjun <yanjun.zhu@xxxxxxxxx>
>>>>>
>>>>> When run "ip link add" command to add a rxe rdma link in a net 
>>>>> namespace, normally this rxe rdma link can not work in a net name 
>>>>> space.
>>>>>
>>>>> The root cause is that a sock listening on udp port 4791 is 
>>>>> created in init_net when the rdma_rxe module is loaded into 
>>>>> kernel. That is, the sock listening on udp port 4791 is created in 
>>>>> init_net. Other net namespace is difficult to use this sock.
>>>>>
>>>>> The following commits will solve this problem.
>>>>>
>>>>> In the first commit, move the creating sock listening on udp port 
>>>>> 4791 from module_init function to rdma link creating functions. 
>>>>> That is, after the module rdma_rxe is loaded, the sock will not be created.
>>>>> When run "rdma link add ..." command, the sock will be created. So 
>>>>> when creating a rdma link in the net namespace, the sock will be 
>>>>> created in this net namespace.
>>>>>
>>>>> In the second commit, the functions udp4_lib_lookup and 
>>>>> udp6_lib_lookup will check the sock exists in the net namespace or 
>>>>> not. If yes, rdma link will increase the reference count of this 
>>>>> sock, then continue other jobs instead of creating a new sock to 
>>>>> listen on udp port 4791. Since the network notifier is global, 
>>>>> when the module rdma_rxe is loaded, this notifier will be registered.
>>>>>
>>>>> After the rdma link is created, the command "rdma link del" is to 
>>>>> delete rdma link at the same time the sock is checked. If the 
>>>>> reference count of this sock is greater than the sock reference 
>>>>> count needed by udp tunnel, the sock reference count is decreased 
>>>>> by one. If equal, it indicates that this rdma link is the last 
>>>>> one. As such, the udp tunnel is shut down and the sock is closed. 
>>>>> The above work should be implemented in linkdel function. But 
>>>>> currently no dellink function in rxe. So the 3rd commit addes 
>>>>> dellink function pointer. And the 4th commit implements the dellink function in rxe.
>>>>>
>>>>> To now, it is not necessary to keep a global variable to store the 
>>>>> sock listening udp port 4791. This global variable can be replaced 
>>>>> by the functions udp4_lib_lookup and udp6_lib_lookup totally. 
>>>>> Because the function udp6_lib_lookup is in the fast path, a member 
>>>>> variable l_sk6 is added to store the sock. If l_sk6 is NULL, 
>>>>> udp6_lib_lookup is called to lookup the sock, then the sock is 
>>>>> stored in l_sk6, in the future,it can be used directly.
>>>>>
>>>>> All the above work has been done in init_net. And it can also work 
>>>>> in the net namespace. So the init_net is replaced by the 
>>>>> individual net namespace. This is what the 6th commit does. 
>>>>> Because rxe device is dependent on the net device and the sock 
>>>>> listening on udp port 4791, every rxe device is in exclusive mode in the individual net namespace.
>>>>> Other rdma netns operations will be considerred in the future.
>>>>>
>>>>> In the 7th commit, the 
>>>>> register_pernet_subsys/unregister_pernet_subsys
>>>>> functions are added. When a new net namespace is created, the init 
>>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks 
>>>>> will be released when the net namespace is destroyed. The 
>>>>> functions
>>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in 
>>>>> the net namespace. The functions 
>>>>> rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will handle sk6. Then sk4 and sk6 are used in the previous commits.
>>>>>
>>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not 
>>>>> necessary to add a new l_sk6. As such, in the 8th commit, the 
>>>>> l_sk6 is replaced with the sk6 in pernet namespace.
>>>>>
>>>>> Test steps:
>>>>> 1) Suppose that 2 NICs are in 2 different net namespaces.
>>>>>
>>>>>     # ip netns exec net0 ip link
>>>>>     3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq 
>>>>> state UP
>>>>>        link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff
>>>>>        altname enp5s0
>>>>>
>>>>>     # ip netns exec net1 ip link
>>>>>     4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc 
>>>>> fq_codel
>>>>>        link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff
>>>>>
>>>>> 2) Add rdma link in the different net namespace
>>>>>       net0:
>>>>>       # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2
>>>>>
>>>>>       net1:
>>>>>       # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3
>>>>>
>>>>> 3) Run rping test.
>>>>>       net0
>>>>>       # ip netns exec net0 rping -s -a 192.168.2.1 -C 1&
>>>>>       [1] 1737
>>>>>       # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1
>>>>>       verbose
>>>>>       count 1
>>>>>       ...
>>>>>       ping data: rdma-ping-0: 
>>>>> ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
>>>>>       ...
>>>>>
>>>>> 4) Remove the rdma links from the net namespaces.
>>>>>       net0:
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer 
>>>>> Address:Port    Process
>>>>>       UNCONN    0         0         0.0.0.0:4791          
>>>>> 0.0.0.0:*
>>>>>       UNCONN    0         0         [::]:4791             [::]:*
>>>>>
>>>>>       # ip netns exec net0 rdma link del rxe0
>>>>>
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer 
>>>>> Address:Port    Process
>>>>>
>>>>>       net1:
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer 
>>>>> Address:Port    Process
>>>>>       UNCONN    0         0         0.0.0.0:4791          
>>>>> 0.0.0.0:*
>>>>>       UNCONN    0         0         [::]:4791             [::]:*
>>>>>
>>>>>       # ip netns exec net1 rdma link del rxe1
>>>>>
>>>>>       # ip netns exec net0 ss -lu
>>>>>       State     Recv-Q    Send-Q    Local Address:Port    Peer 
>>>>> Address:Port    Process
>>>>>
>>>>> V4->V5: Rebase the commits to V6.4-rc1
>>>>>
>>>>> V3->V4: Rebase the commits to rdma-next;
>>>>>
>>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and 
>>>>> V2->use "ss -lu" to
>>>>>              verify rdma link is removed.
>>>>>           2) Add register_pernet_subsys/unregister_pernet_subsys 
>>>>> net namespace
>>>>>           3) Replace l_sk6 with sk6 of pernet_name_space
>>>>>
>>>>> V1->V2: Add the explicit initialization of sk6.
>>>>>
>>>>> Zhu Yanjun (8):
>>>>>     RDMA/rxe: Creating listening sock in newlink function
>>>>>     RDMA/rxe: Support more rdma links in init_net
>>>>>     RDMA/nldev: Add dellink function pointer
>>>>>     RDMA/rxe: Implement dellink in rxe
>>>>>     RDMA/rxe: Replace global variable with sock lookup functions
>>>>>     RDMA/rxe: add the support of net namespace
>>>>>     RDMA/rxe: Add the support of net namespace notifier
>>>>>     RDMA/rxe: Replace l_sk6 with sk6 in net namespace
>>>>>
>>>>>    drivers/infiniband/core/nldev.c     |   6 ++
>>>>>    drivers/infiniband/sw/rxe/Makefile  |   3 +-
>>>>>    drivers/infiniband/sw/rxe/rxe.c     |  35 +++++++-
>>>>>    drivers/infiniband/sw/rxe/rxe_net.c | 113 
>>>>> +++++++++++++++++------
>>>>>    drivers/infiniband/sw/rxe/rxe_net.h |   9 +-
>>>>>    drivers/infiniband/sw/rxe/rxe_ns.c  | 134 
>>>>> ++++++++++++++++++++++++++++ip netns add test
>>>>>    drivers/infiniband/sw/rxe/rxe_ns.h  |  17 ++++
>>>>>    include/rdma/rdma_netlink.h         |   2 +
>>>>>    8 files changed, 279 insertions(+), 40 deletions(-)
>>>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c
>>>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns 
>>>>> add test
>>>>>
>>>> Zhu,
>>>>
>>>> I did some simple experiments on netns functionality.
>>>>
>>>> With your patch set applied and rxe0 created on enp6s0 and rxe1 
>>>> created on lo in the default namespace
>>>>
>>>>      # sudo ip netns add test
>>>>      # ip netns
>>>>      test
>>>>      # sudo ip netns exec test ip link
>>>>      1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT 
>>>> group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>      # sudo ip netns exec test ip link set dev lo up
>>>>      # sudo ip netns exec test ip link
>>>>      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state 
>>>> UNKNOWN mode DEFAULT group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>      # sudo ip netns exec test ip addr add dev lo 
>>>> fe80::0200:00ff:fe00:0000/64
>>>>          [rxe doesn't work unless this IPV6 address is set]
>>>>      # sudo ip netns exec test ip addr
>>>>      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state 
>>>> UNKNOWN group default qlen 1000
>>>>          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>          inet 127.0.0.1/8 scope host lo
>>>>             valid_lft forever preferred_lft forever
>>>>          inet6 fe80::200:ff:fe00:0/64 scope link
>>>>             valid_lft forever preferred_lft forever
>>>>          inet6 ::1/128 scope host
>>>>             valid_lft forever preferred_lft forever
>>>>      # sudo ip netns exec test ls /sys/class/infiniband
>>>>      rxe0  rxe1
>>>>          [These show up even though the ndevs do *not* belong to 
>>>> the test namespace! Probably OK.]
>>>>      # sudo ip netns exec test rdma link add rxe2 type rxe netdev 
>>>> lo
>>>>      # ls /sys/class/infiniband
>>>>      rxe0  rxe1  rxe2
>>>>          [The new rxe device shows up in the default namespace. At 
>>>> least we're consistent.]
>>>>      # ib_send_bw -d rxe0 ... 192.168.0.27
>>>>          [Works. Didn't break the existing rxe devices. Expected]
>>>>      # ib_send_bw -d rxe1 ... 127.0.0.1
>>>>          [Works. Expected]
>>>>      # ib_send_bw -d rxe2 ... 127.0.0.1
>>>>      IB device rxe2 not found
>>>>         Unable to find the Infiniband/RoCE device
>>>>          [Not work. Expected.]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>>      IB device rxe2 not found
>>>>       Unable to find the Infiniband/RoCE device
>>>>          [Also not work. Turns out rxe2 device is gone after 
>>>> failure. Not expected.]
>>>>      # sudo ip netns exec test rdma link add rxe2 type rxe netdev 
>>>> lo
>>>>      # ls /sys/class/infiniband
>>>>      rxe0  rxe1  rxe2
>>>>          [Good. It's back]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1
>>>>          [Works in test namespace! Expected.]
>>>>      # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1
>>>>          [Also works. Definitely not expected.]
>>>>
>>>> My take, it sort of works. But there are some serious issues. You 
>>>> shouldn't be able to use the
>>>> rxe2 device in the default namespace. It would be nice if you 
>>>> couldn't see the rxe devices in each other's namespaces (Like ip 
>>>> link or ip addr hide other namespace's devices.)
>>>>
>>>> Bob
>>> Forgot to mention. It also is definitely not good that a process in 
>>> the default namespace can destroy a rxe device in the test namespace by trying to use it.
>> Thanks a lot.
>>
>> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace.
>>
>> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace.
>>
>> I am not sure if this is correct or not.
>>
>> Zhu Yanjun
>>
>>> Bob
> I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed.
> The rxe device was destroyed as a side effect of failing to open it.

The GID of rxe can not be generated with lo. This is a problem. Now Chuck Lever <cel@xxxxxxxxxx> will fix it.

Not sure if the problem that you confronted is related with this. Please use physical NIC to make tests again.

Thanks a lot.

Zhu Yanjun

>
> Bob

That was why I added the IPV6 address by hand. That created the gid table entry. This is
also a problem for all ethernet devices for distros that mangle the MAC address when creating the
IPV6 address as a security measure. These include Ubuntu which I use. So I have to always add
an IPV6 address based on the MAC address for any ethernet device.

Bob




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux