Re: A question about FAILOVER event in RoCE LAG

Mark Bloch <mbloch@xxxxxxxxxx> · Fri, 3 Mar 2023 07:34:54 +0200

On 03/03/2023 4:36, Junxian Huang wrote:
> 
> 
> On 2023/2/26 21:57, Mark Bloch wrote:
>>
>>
>> On 26/02/2023 14:53, Leon Romanovsky wrote:
>>> +Mark
>>>
>>> On Fri, Feb 24, 2023 at 11:14:47AM +0000, huangjunxian (C) wrote:
>>>> Hi folks!
>>>>
>>>> We've been working on LAG in hns RoCE driver, and we notice that when a FAILOVER event
>>>> occurs in active-backup mode, all GIDs of the RDMA bond device are deleted and new GIDs
>>>> are added, triggered by the event handler listed below.
>>>>
>>>> So, when a FAILOVER event occurs on a RDMA bond device with running traffic, does it make
>>>> sense that the traffic is terminated since its GIDs are deleted?
>>
>> Yep, please read the original commit message:
>>
>> commit 238fdf48f2b54a01cedb5774c3a1e81c94e1a3a0
>> Author: Matan Barak <matanb@xxxxxxxxxxxx>
>> Date:   Thu Jul 30 18:33:27 2015 +0300
>>
>>     IB/core: Add RoCE table bonding support
>>
>>     Handling bonding and other devices require us to all all GIDs of the
>>     net-devices which are upper-devices of the RoCE port related
>>     net-device.
>>
>>     Active-backup configurations imposes even more challenges as the
>>     default GID should only be set on the active devices (this is
>>     necessary as otherwise the same MAC could be used for several
>>     slaves and thus several slaves will have identical GIDs).
>>
>>     Managing these configurations are done by listening to:
>>     (a) NETDEV_CHANGEUPPER event
>>             (1) if a related net-device is linked, delete all inactive
>>                 slaves default GIDs and add the upper device GIDs.
>>             (2) if a related net-device is unlinked, delete all upper GIDs
>>                 and add the default GIDs.
>>     (b) NETDEV_BONDING_FAILOVER:
>>             (1) delete the bond GIDs from inactive slaves
>>             (2) delete the inactive slave's default GIDs
>>             (3) Add the bond GIDs to the active slave.
>>
>>     Signed-off-by: Matan Barak <matanb@xxxxxxxxxxxx>
>>     Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>
>>
>> and please read: https://wiki.linuxfoundation.org/networking/bonding
>> especially the section that explains some of the restrictions of
>> active-backup mode.
>>
>> Mark
>>
>>>>
>>>> The FAILOVER event handler mentioned above:
>>>> static int netdevice_event(struct notifier_block *this, unsigned long event, void *ptr)
>>>> {
>>>>          ......
>>>>          static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
>>>>                   .cb = del_netdev_upper_ips, .filter = upper_device_filter};
>>>>          ......
>>>>          switch (event) {
>>>>          ......
>>>>          case NETDEV_BONDING_FAILOVER:
>>>>                   cmds[0] = bonding_event_ips_del_cmd;
>>>>                   /* Add default GIDs of the bond device */
>>>>                   cmds[1] = bonding_default_add_cmd;
>>>>                   /* Add IP based GIDs of the bond device */
>>>>                   cmds[2] = add_cmd_upper_ips;
>>>>                   break;
>>>>          ......
>>>>          }
>>>>          ......
>>>> }
>>>>
>>>> Thanks,
>>>> Junxian
> 
> Thanks for replying. I'd like to explain my question more specifically.
> 
> When a GID is deleted, the GID index is not reusable until all references
> are dropped, according to this commit message:
> 
> commit be5914c124bc3179536e5c4598f59aeb4b880517
> Author: Parav Pandit <parav@xxxxxxxxxxxx>
> Date:   Tue Dec 18 14:16:00 2018 +0200
> 
>     RDMA/core: Delete RoCE GID in hw when corresponding IP is deleted
> 
>     Currently a RoCE GID entry is removed from the hardware when all
>     references to the GID entry drop to zero. This is a change in behavior
>     from before the fixed patch. The GID entry should be removed from the
>     hardware when GID entry deletion is requested. This allows the driver
>     terminate ongoing traffic through the RoCE GID.
> 
>     While a GID is deleted from the hardware, GID slot in the software GID
>     cache is not freed. GID slot is freed once all references of such GID are
>     dropped. This continue to ensure that such GID slot of hardware is not
>     allocated to new GID entry allocation request. It is allocated once all
>     references to GID entry drop.
> 
>     This approach allows drivers to put a tombestone of some kind on the HW
>     GID index to block the traffic.
> 
>     Fixes: b150c3862d21 ("IB/core: Introduce GID entry reference counts")
>     Signed-off-by: Parav Pandit <parav@xxxxxxxxxxxx>
>     Reviewed-by: Mark Bloch <markb@xxxxxxxxxxxx>
>     Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
>     Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx>
> 
> Assume that a RoCE bond device is running RoCE traffic with a GID in index 3.
> 
> When a FAILOVER event occurs, the original GIDs in index 0-3 are deleted
> and new GIDs are added. As the QP is still holding a reference to index 3,
> the new GIDs are not added to index 0-3 but to index 0-2 and index 4.
> 
> Due to the deletion of the original GID in index 3, the RoCE traffic is
> terminated. Although a backup slave becomes active, the traffic will
> not recover anymore as the new GID is not added to index 3 and there is no
> valid GID in index 3. This means the active-backup mode in RoCE LAG cannot
> maintain ongoing RoCE traffic when a FAILOVER event happens.

If you want fault tolerance and keep the traffic going you should use LACP.

> 
> Are the phenomenon and conclusion here correct? This seems odd to me because
> as far as I know, there is no such restriction on the active-backup mode in
> NIC Bonding.

You are talking about netdev bonding I assume. With netdev traffic you can
control which netdev is used so traffic won't egress from multiple physical
ports (this is what the bond is for). With RDMA we can't control that as the
QP is used by a user (kernel / userspace) so if the traffic won't be stopped
the switch will see the same source MAC address from two different ports.

Of course there can be other solutions, for example you can make sure only 1
physical port is used for egress (with steering or with
blocking traffic going via the slave port). If your hardware can do something
like that it should be possible to support that.

Mark

> 
> Junxian
> 
> 
> 
> 
> 
> 
> 
> 
>