Re: A question about FAILOVER event in RoCE LAG

Junxian Huang <huangjunxian6@xxxxxxxxxxxxx> · Fri, 3 Mar 2023 10:36:42 +0800

On 2023/2/26 21:57, Mark Bloch wrote:
> 
> 
> On 26/02/2023 14:53, Leon Romanovsky wrote:
>> +Mark
>>
>> On Fri, Feb 24, 2023 at 11:14:47AM +0000, huangjunxian (C) wrote:
>>> Hi folks!
>>>
>>> We've been working on LAG in hns RoCE driver, and we notice that when a FAILOVER event
>>> occurs in active-backup mode, all GIDs of the RDMA bond device are deleted and new GIDs
>>> are added, triggered by the event handler listed below.
>>>
>>> So, when a FAILOVER event occurs on a RDMA bond device with running traffic, does it make
>>> sense that the traffic is terminated since its GIDs are deleted?
> 
> Yep, please read the original commit message:
> 
> commit 238fdf48f2b54a01cedb5774c3a1e81c94e1a3a0
> Author: Matan Barak <matanb@xxxxxxxxxxxx>
> Date:   Thu Jul 30 18:33:27 2015 +0300
> 
>     IB/core: Add RoCE table bonding support
> 
>     Handling bonding and other devices require us to all all GIDs of the
>     net-devices which are upper-devices of the RoCE port related
>     net-device.
> 
>     Active-backup configurations imposes even more challenges as the
>     default GID should only be set on the active devices (this is
>     necessary as otherwise the same MAC could be used for several
>     slaves and thus several slaves will have identical GIDs).
> 
>     Managing these configurations are done by listening to:
>     (a) NETDEV_CHANGEUPPER event
>             (1) if a related net-device is linked, delete all inactive
>                 slaves default GIDs and add the upper device GIDs.
>             (2) if a related net-device is unlinked, delete all upper GIDs
>                 and add the default GIDs.
>     (b) NETDEV_BONDING_FAILOVER:
>             (1) delete the bond GIDs from inactive slaves
>             (2) delete the inactive slave's default GIDs
>             (3) Add the bond GIDs to the active slave.
> 
>     Signed-off-by: Matan Barak <matanb@xxxxxxxxxxxx>
>     Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>
> 
> and please read: https://wiki.linuxfoundation.org/networking/bonding
> especially the section that explains some of the restrictions of
> active-backup mode.
> 
> Mark
> 
>>>
>>> The FAILOVER event handler mentioned above:
>>> static int netdevice_event(struct notifier_block *this, unsigned long event, void *ptr)
>>> {
>>>          ......
>>>          static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
>>>                   .cb = del_netdev_upper_ips, .filter = upper_device_filter};
>>>          ......
>>>          switch (event) {
>>>          ......
>>>          case NETDEV_BONDING_FAILOVER:
>>>                   cmds[0] = bonding_event_ips_del_cmd;
>>>                   /* Add default GIDs of the bond device */
>>>                   cmds[1] = bonding_default_add_cmd;
>>>                   /* Add IP based GIDs of the bond device */
>>>                   cmds[2] = add_cmd_upper_ips;
>>>                   break;
>>>          ......
>>>          }
>>>          ......
>>> }
>>>
>>> Thanks,
>>> Junxian

Thanks for replying. I'd like to explain my question more specifically.

When a GID is deleted, the GID index is not reusable until all references
are dropped, according to this commit message:

commit be5914c124bc3179536e5c4598f59aeb4b880517
Author: Parav Pandit <parav@xxxxxxxxxxxx>
Date:   Tue Dec 18 14:16:00 2018 +0200

    RDMA/core: Delete RoCE GID in hw when corresponding IP is deleted

    Currently a RoCE GID entry is removed from the hardware when all
    references to the GID entry drop to zero. This is a change in behavior
    from before the fixed patch. The GID entry should be removed from the
    hardware when GID entry deletion is requested. This allows the driver
    terminate ongoing traffic through the RoCE GID.

    While a GID is deleted from the hardware, GID slot in the software GID
    cache is not freed. GID slot is freed once all references of such GID are
    dropped. This continue to ensure that such GID slot of hardware is not
    allocated to new GID entry allocation request. It is allocated once all
    references to GID entry drop.

    This approach allows drivers to put a tombestone of some kind on the HW
    GID index to block the traffic.

    Fixes: b150c3862d21 ("IB/core: Introduce GID entry reference counts")
    Signed-off-by: Parav Pandit <parav@xxxxxxxxxxxx>
    Reviewed-by: Mark Bloch <markb@xxxxxxxxxxxx>
    Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
    Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxxxx>

Assume that a RoCE bond device is running RoCE traffic with a GID in index 3.

When a FAILOVER event occurs, the original GIDs in index 0-3 are deleted
and new GIDs are added. As the QP is still holding a reference to index 3,
the new GIDs are not added to index 0-3 but to index 0-2 and index 4.

Due to the deletion of the original GID in index 3, the RoCE traffic is
terminated. Although a backup slave becomes active, the traffic will
not recover anymore as the new GID is not added to index 3 and there is no
valid GID in index 3. This means the active-backup mode in RoCE LAG cannot
maintain ongoing RoCE traffic when a FAILOVER event happens.

Are the phenomenon and conclusion here correct? This seems odd to me because
as far as I know, there is no such restriction on the active-backup mode in
NIC Bonding.

Junxian