On 30/03/2022 14:14, Alexandra Winter wrote: > > > On 30.03.22 12:23, Nikolay Aleksandrov wrote: >> On 30/03/2022 03:54, Jakub Kicinski wrote: >>> Dropping the BPF people from CC and adding Hangbin, bridge and >>> bond/team. Please exercise some judgment when sending patches. > Thank you. > I did 'scripts/get_maintainer.pl drivers/net/veth.c' > but was a bit surprised about the outcome. > >>> >>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote: >>>> Bonding drivers generate specific events during failover that trigger >>>> switch updates. When a veth device is attached to a bridge with a >>>> bond interface, we want external switches to learn about the veth >>>> devices as well. >>>> >>>> Example: >>>> >>>> | veth_a2 | veth_b2 | veth_c2 | >>>> ------o-----------o----------o------ >>>> \ | / >>>> o o o >>>> veth_a1 veth_b1 veth_c1 >>>> ------------------------- >>>> | bridge | >>>> ------------------------- >>>> bond0 >>>> / \ >>>> eth0 eth1 >>>> >>>> In case of failover from eth0 to eth1, the netdev_notifier needs to be >>>> propagated, so e.g. veth_a2 can re-announce its MAC address to the >>>> external hardware attached to eth1. >>>> >>>> Without this patch we have seen cases where recovery after bond failover >>>> took an unacceptable amount of time (depending on timeout settings in the >>>> network). >>>> >>>> Due to the symmetric nature of veth special care is required to avoid >>>> endless notification loops. Therefore we only notify from a veth >>>> bridgeport to a peer that is not a bridgeport. >>>> >>>> References: >>>> Same handling as for macvlan: >>>> commit 4c9912556867 ("macvlan: Support bonding events") >>>> and vlan: >>>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event") >>>> >>>> Alternatives: >>>> Propagate notifier events to all ports of a bridge. IIUC, this was >>>> rejected in https://www.spinics.net/lists/netdev/msg717292.html >>> >>> My (likely flawed) reading of Nik's argument was that (1) he was >>> concerned about GARP storms; (2) he didn't want the GARP to be >>> broadcast to all ports, just the bond that originated the request. >>> >> >> Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is >> especially important for large setups with lots of devices. > > One way to target the bond that originated the request, would be if the > bridge itself would do GARPs/RARPS/.., on this bond port for all MACs > that are in its FDB. What do you think about that? > That's a hack and you can already do it easily in user-space, you don't need anything special in the bridge. It is also very specific, and it should only happen in certain situations (e.g. a/b flap) which the bridge doesn't really know about, but user-space does because it can see the notifications and can see the bond mode. >> >>> I'm not sure I follow (1), as Hangbin said the event is rare, plus >>> GARP only comes from interfaces that have an IP addr, which IIUC >>> most bridge ports will not have. >>> >> >> Indeed, such setups are not the most common ones. >> >>> This patch in no way addresses (2). But then, again, if we put >>> a macvlan on top of a bridge master it will shotgun its GARPS all >>> the same. So it's not like veth would be special in that regard. >>> >>> Nik, what am I missing? >>> >> >> If we're talking about macvlan -> bridge -> bond then the bond flap's >> notify peers shouldn't reach the macvlan. Generally broadcast traffic >> is quite expensive for the bridge, I have patches that improve on the >> technical side (consider ports only for the same bcast domain), but you also >> wouldn't want unnecessary bcast packets being sent around. :) >> There are setups with tens of bond devices and propagating that to all would be >> very expensive, but most of all unnecessary. It would also hurt setups with >> a lot of vlan devices on the bridge. There are setups with hundreds of vlans >> and hundreds of macvlans on top, propagating it up would send it to all of >> them and that wouldn't scale at all, these mostly have IP addresses too. >> >> Perhaps we can enable propagation on a per-port or per-bridge basis, then we >> can avoid these walks. That is, make it opt-in. >> >>>> It also seems difficult to avoid re-bouncing the notifier. >>> >>> syzbot will make short work of this patch, I think the potential >>> for infinite loops has to be addressed somehow. IIUC this is the >>> first instance of forwarding those notifiers to a peer rather >>> than within a upper <> lower device hierarchy which is a DAG. > > My concern was about the Hangbin's alternative proposal to notify all > bridge ports. I hope in my porposal I was able to avoid infinite loops. > >>> >>>> Signed-off-by: Alexandra Winter <wintera@xxxxxxxxxxxxx> >>>> --- >>>> drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 53 insertions(+) >>>> >>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >>>> index d29fb9759cc9..74b074453197 100644 >>>> --- a/drivers/net/veth.c >>>> +++ b/drivers/net/veth.c >>>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device *dev) >>>> dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE; >>>> } >>>> >>>> +static bool netif_is_veth(const struct net_device *dev) >>>> +{ >>>> + return (dev->netdev_ops == &veth_netdev_ops); >>> >>> brackets unnecessary >>> >>>> +} >>>> + >>>> +static void veth_notify_peer(unsigned long event, const struct net_device *dev) >>>> +{ >>>> + struct net_device *peer; >>>> + struct veth_priv *priv; >>>> + >>>> + priv = netdev_priv(dev); >>>> + peer = rtnl_dereference(priv->peer); >>>> + /* avoid re-bounce between 2 bridges */ >>>> + if (!netif_is_bridge_port(peer)) >>>> + call_netdevice_notifiers(event, peer); >>>> +} >>>> + >>>> +/* Called under rtnl_lock */ >>>> +static int veth_device_event(struct notifier_block *unused, >>>> + unsigned long event, void *ptr) >>>> +{ >>>> + struct net_device *dev, *lower; >>>> + struct list_head *iter; >>>> + >>>> + dev = netdev_notifier_info_to_dev(ptr); >>>> + >>>> + switch (event) { >>>> + case NETDEV_NOTIFY_PEERS: >>>> + case NETDEV_BONDING_FAILOVER: >>>> + case NETDEV_RESEND_IGMP: >>>> + /* propagate to peer of a bridge attached veth */ >>>> + if (netif_is_bridge_master(dev)) { >>> >>> Having veth sift thru bridge ports seems strange. >>> In fact it could be beneficial to filter the event based on >>> port state (whether it's forwarding, vlan etc). But looking >>> at details of port state outside the bridge would be even stranger. >>> >>>> + iter = &dev->adj_list.lower; >>>> + lower = netdev_next_lower_dev_rcu(dev, &iter); >>>> + while (lower) { >>>> + if (netif_is_veth(lower)) >>>> + veth_notify_peer(event, lower); >>>> + lower = netdev_next_lower_dev_rcu(dev, &iter); >>> >>> let's add netdev_for_each_lower_dev_rcu() rather than open-coding >>> >>>> + } >>>> + } >>>> + break; >>>> + default: >>>> + break; >>>> + } >>>> + return NOTIFY_DONE; >>>> +} >>>> + >>>> +static struct notifier_block veth_notifier_block __read_mostly = { >>>> + .notifier_call = veth_device_event, >>> >>> extra tab here >>> >>>> +}; >>>> + >>>> /* >>>> * netlink interface >>>> */ >>>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops = { >>>> >>>> static __init int veth_init(void) >>>> { >>>> + register_netdevice_notifier(&veth_notifier_block); >>> >>> this can fail >>> >>>> return rtnl_link_register(&veth_link_ops); >>>> } >>>> >>>> static __exit void veth_exit(void) >>>> { >>>> rtnl_link_unregister(&veth_link_ops); >>>> + unregister_netdevice_notifier(&veth_notifier_block); >>>> } >>>> >>>> module_init(veth_init); >>> >>