On 31.03.22 12:33, Nikolay Aleksandrov wrote: > On 31/03/2022 12:59, Alexandra Winter wrote: >> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote: >>> Bonding drivers generate specific events during failover that trigger >>> switch updates. When a veth device is attached to a bridge with a >>> bond interface, we want external switches to learn about the veth >>> devices as well. >>> >>> Example: >>> >>> | veth_a2 | veth_b2 | veth_c2 | >>> ------o-----------o----------o------ >>> \ | / >>> o o o >>> veth_a1 veth_b1 veth_c1 >>> ------------------------- >>> | bridge | >>> ------------------------- >>> bond0 >>> / \ >>> eth0 eth1 >>> >>> In case of failover from eth0 to eth1, the netdev_notifier needs to be >>> propagated, so e.g. veth_a2 can re-announce its MAC address to the >>> external hardware attached to eth1. >> >> On 30.03.22 21:15, Jay Vosburgh wrote: >>> Jakub Kicinski <kuba@xxxxxxxxxx> wrote: >>> >>>> On Wed, 30 Mar 2022 19:16:42 +0300 Nikolay Aleksandrov wrote: >>>>>> Maybe opt-out? But assuming the event is only generated on >>>>>> active/backup switch over - when would it be okay to ignore >>>>>> the notification? >>>>> >>>>> Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in >>>>> make it default on? IMO that would be a problem, large scale setups would suddenly >>>>> start propagating it to upper devices which would cause a lot of unnecessary bcast. >>>>> I meant enable it only if needed, and only on specific ports (second part is not >>>>> necessary, could be global, I think it's ok either way). I don't think any setup >>>>> which has many upper vlans/macvlans would ever enable this. >>>> >>>> That may be. I don't have a good understanding of scenarios in which >>>> GARP is required and where it's not :) Goes without saying but the >>>> default should follow the more common scenario. >>> >>> At least from the bonding failover persective, the GARP is >>> needed when there's a visible topology change (so peers learn the new >>> path), a change in MAC address, or both. I don't think it's possible to >>> determine from bonding which topology changes are visible, so any >>> failover gets a GARP. The original intent as best I recall was to cover >>> IP addresses configured on the bond itself or on VLANs above the bond. >>> >>> If I understand the original problem description correctly, the >>> bonding failover causes the connectivity issue because the network >>> segments beyond the bond interfaces don't share forwarding information >>> (i.e., they are completely independent). The peer (end station or >>> switch) at the far end of those network segments (where they converge) >>> is unable to directly see that the "to bond eth0" port went down, and >>> has no way to know that anything is awry, and thus won't find the new >>> path until an ARP or forwarding entry for "veth_a2" (from the original >>> diagram) times out at the peer out in the network. >>> >>>>>>> My concern was about the Hangbin's alternative proposal to notify all >>>>>>> bridge ports. I hope in my porposal I was able to avoid infinite loops. >>>>>> >>>>>> Possibly I'm confused as to where the notification for bridge master >>>>>> gets sent.. >>>>> >>>>> IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would >>>>> generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS). >>>> >>>> Ack, I was basically repeating the question of where does >>>> the notification with dev == br get generated. >>>> >>>> There is a protection in this patch to make sure the other >>>> end of the veth is not plugged into a bridge (i.e. is not >>>> a bridge port) but there can be a macvlan on top of that >>>> veth that is part of a bridge, so IIUC that check is either >>>> insufficient or unnecessary. >>> >>> I'm a bit concerned this is becoming a interface plumbing >>> topology change whack-a-mole. >>> >>> In the above, what if the veth is plugged into a bridge, and >>> there's a end station on that bridge? If it's bridges all the way down, >>> where does the need for some kind of TCN mechanism stop? >>> >>> Or instead of a veth it's an physical network hop (perhaps a >>> tunnel; something through which notifiers do not propagate) to another >>> host with another bridge, then what? >>> >>> -J >>> >>> --- >>> -Jay Vosburgh, jay.vosburgh@xxxxxxxxxxxxx >> >> I see 3 technologies that are used for network virtualization in combination with bond for redundancy >> (and I may miss some): >> (1) MACVTAP/MACVLAN over bond: >> MACVLAN propagates notifiers from bond to endpoints (same as VLAN) >> (drivers/net/macvlan.c: >> case NETDEV_NOTIFY_PEERS: >> case NETDEV_BONDING_FAILOVER: >> case NETDEV_RESEND_IGMP: >> /* Propagate to all vlans */ >> list_for_each_entry(vlan, &port->vlans, list) >> call_netdevice_notifiers(event, vlan->dev); >> }) >> (2) OpenVSwitch: >> OVS seems to have its own bond implementation, but sends out reverse Arp on active-backup failover >> (3) User defined bridge over bond: >> propagates notifiers to the bridge device itself, but not to the devices attached to bridge ports. >> (net/bridge/br.c: >> case NETDEV_RESEND_IGMP: >> /* Propagate to master device */ >> call_netdevice_notifiers(event, br->dev);) >> >> Active-backup may not be the best bonding mode, but it is a simple way to achieve redundancy and I've seen it being used. >> I don't see a usecase for MACVLAN over bridge over bond (?) > > If you're talking about this particular case (network virtualization) - sure. But macvlans over bridges > are heavily used in Cumulus Linux and large scale setups. For example VRRP is implemented using macvlan > devices. Any notification that propagates to the bridge and reaches these would cause a storm of broadcasts > being sent down which would not scale and is extremely undesirable in general. > >> The external HW network does not need to be updated about the instances that are conencted via tunnel, >> so I don't see an issue there. >> >> I had this idea how to solve the failover issue it for veth pairs attached to the user defined bridge. >> Does this need to be configurable? How? Per veth pair? > > That is not what I meant (if you were referring to my comment), I meant if it gets implemented in the > bridge and it starts propagating the notify peers notifier - that _must_ be configurable. > >> >> Of course a more general solution how bridge over bond could handle notifications, would be great, >> but I'm running out of ideas. So I thought I'd address veth first. >> Your help and ideas are highly appreciated, thank you. > > I'm curious why it must be done in the kernel altogether? This can obviously be solved in user-space > by sending grat arps towards flapped por for fdbs on other ports (e.g. veths) based on a netlink notification. > In fact based on your description propagating NETDEV_NOTIFY_PEERS to bridge ports wouldn't help > because in that case the remote peer veth will not generate a grat arp. The notification will > get propagated only to local veth (bridge port), or the bridge itself depending on implementation. > > So from bridge perspective, if you decide to pursue a kernel solution, I think you'll need > a new bridge port option which acts on NOTIFY_PEERS and generates a grat arp for all fdbs > on the port where it is enabled to the port which generated the NOTIFY_PEERS. Note that is > also fragile as I'm sure some stacked device config would not work, so I want to re-iterate > how much easier it is to solve it in user-space which has better visibility and you can > change much faster to accommodate new use cases. > > To illustrate: bond > \ > bridge > / > veth0 > | > veth1 > > When bond generates NOTIFY_PEERS, and you have this new option enabled on veth0 then > the bridge should generate grat arps for all fdbs on veth0 towards bond so the new > path would learn them. Note that is very dangerous as veth1 can generate thousands > of fdbs and you can potentially DDoS the whole network, so again I'd advise to do > this in user-space where you can better control it. > > W.r.t to this patch, I think it will also work and will cause a single grat arp which > is ok. Just need to make sure loops are not possible, for example I think you can loop > your implementation by the following config (untested theory): > bond > \ > bridge > \ \ > veth2.10 veth0 - veth1 > \ \ > \ veth1.10 (vlan) > \ \ > \ bridge2 > \ / > veth2 - veth3 > > > 1. bond generates NOTIFY_PEERS > 2. bridge propagates to veth1 (through veth0 port) > 3. veth1 propagates to its vlan (veth1.10) > 4. bridge2 sees veth1.10 NOTIFY_PEERS and propagates to veth2 (through veth3 port) > 5. veth2 propagates to its vlan (veth2.10) > 6. veth2.10 propagates it back to bridge > <loop> > > I'm sure similar setup, and maybe even simpler, can be constructed with other devices > which can propagate or generate NOTIFY_PEERS. > > Cheers, > Nik > Thank you very much Nik for your advice and your thourough explanations. I think I could prevent the loop you describe above. But I agree that propagating notifications through veth is more risky, because there is no upper-lower relationship. Is there interest in a v3 of my patch, where I would also incorporate Jakub's comments? Otherwise I would next explore a user-space solution like Nik proposed. Thank you all very much Alexandra