Re: Is anyone working on implementing LAG in ib core?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jason, Weihang,

On 2/22/2020 5:40 PM, Jason Gunthorpe wrote:
> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote:
>> Hi all,
>>
>> We plan to implement LAG in hns drivers recently, and as we know, there is
>> already a mature and stable solution in the mlx5 driver. Considering that
>> the net subsystem in kernel adopt the strategy that the framework implement
>> bonding, we think it's reasonable to add LAG feature to the ib core based
>> on mlx5's implementation. So that all drivers including hns and mlx5 can
>> benefit from it.
>>
>> In previous discussions with Leon about achieving reporting of ib port link
>> event in ib core, Leon mentioned that there might be someone trying to do
>> this.
>>
>> So may I ask if there is anyone working on LAG in ib core or planning to
>> implement it in near future? I will appreciate it if you can share your
>> progress with me and maybe we can finish it together.
>>
>> If nobody is working on this, our team may take a try to implement LAG in
>> the core. Any comments and suggestion are welcome.
> 
> This is something that needs to be done, I understand several of the
> other drivers are going to want to use LAG and we certainly can't have
> everything copied into each driver.
> 
> Jason
> 
I am not sure mlx5 is right model for new rdma bond device support which
I tried to highlight in Q&A-1 below.

I have below not-so-refined proposal for rdma bond device.

- Create a rdma bond device named rbond0 using two slave rdma devices
mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma
device of mlx5_0 rdma device.

$ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave
mlx5_1 dmadevice mlx5_0

$ rdma dev show
0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660
sys_image_guid 248a:0703:0055:4660
1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661
sys_image_guid 248a:0703:0055:4660
2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid
248a:0703:0055:4660

- This should be done via rdma bond driver in
drivers/infiniband/ulp/rdma_bond

Few obvious questions arise from above proposal:
1. But why can't I do the trick to remove two or more rdma device(s) and
create one device when bond0 netdevice is created?
Ans:
(a) Because it leads to complex driver code in vendor driver to handle
netdev events under rtnl lock.
Given GID table needs to hold rtnl lock for short duration in
ib_register_device(), things need to differ to work queue and perform
synchronization.
(b) User cannot predict when this new rdma bond device will be created
automatically, after how long?
(c) What if some failure occurred, should I parse /var/log/messages to
figure out the error? What steps to roll back and retry?
(d) What if driver internally attempt retry?..
and some more..

2. Why do we need to give netdevice in above proposal?
Ans:
Because for RoCE device you want to build right GID table for its
matching netdevice. No guess work.

3. But with that there will be multiple devices rbond0, mlx5_0 with same
GID table entries.
And that will confuse the user.
What do we do about it?
Ans:
No. That won't happen, because this bond driver accept slave rdma devices.
bond driver will request IB core to disable GID table of slave rdma devices.
Or we can have commands to disable/enable specific GID types of slave
rdma devices, which user can invoke before creating rdma bond device.

Such as
$ rdma link mlx5_0 disable rocev1
$ rdma link mlx5_0 enable rocev2

This way its easy to compose and addressed wider use case where RoCEv1
GID table entries can be disabled and make efficient use of GID table.

4. But if we are going to disable the GID table, why do we even need
those RDMA devices?
Ans:
Because when you delete the bond rdma device, you can revert back to
those mlx5_0/1 devices.
Follwo mirror of add in delete.

5. Why do we need to give DMA device?
Ans:
Because we want to avoid doing guess work in rdma_bond driver on which
DMA device to use.
User knows the configuration of how he wants to use it based on the
system. (irqs etc).

6. What happens if slave pci devices are hot removed?
Ans:
If slave dma device is removed, it disassociates the ucontext and rbond0
becomes unusable for applications.

7. How is the failover done?
Ans: Since failover netdevice is provided, its failover settings are
inherited by rbond0 rdma device and passed on to its slave device.




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux