On Mon, Feb 12, 2024 at 08:18:43PM +0100, Tobias Waldekranz wrote: > Before this change, generation of the list of MDB events to replay > would race against the creation of new group memberships, either from > the IGMP/MLD snooping logic or from user configuration. > > While new memberships are immediately visible to walkers of > br->mdb_list, the notification of their existence to switchdev event > subscribers is deferred until a later point in time. So if a replay > list was generated during a time that overlapped with such a window, > it would also contain a replay of the not-yet-delivered event. > > The driver would thus receive two copies of what the bridge internally > considered to be one single event. On destruction of the bridge, only > a single membership deletion event was therefore sent. As a > consequence of this, drivers which reference count memberships (at > least DSA), would be left with orphan groups in their hardware > database when the bridge was destroyed. > > This is only an issue when replaying additions. While deletion events > may still be pending on the deferred queue, they will already have > been removed from br->mdb_list, so no duplicates can be generated in > that scenario. > > To a user this meant that old group memberships, from a bridge in > which a port was previously attached, could be reanimated (in > hardware) when the port joined a new bridge, without the new bridge's > knowledge. > > For example, on an mv88e6xxx system, create a snooping bridge and > immediately add a port to it: > > root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ > > ip link set dev x3 up master br0 > > And then destroy the bridge: > > root@infix-06-0b-00:~$ ip link del dev br0 > root@infix-06-0b-00:~$ mvls atu > ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a > DEV:0 Marvell 88E6393X > 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . > 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . > ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a > root@infix-06-0b-00:~$ > > The two IPv6 groups remain in the hardware database because the > port (x3) is notified of the host's membership twice: once via the > original event and once via a replay. Since only a single delete > notification is sent, the count remains at 1 when the bridge is > destroyed. > > Then add the same port (or another port belonging to the same hardware > domain) to a new bridge, this time with snooping disabled: > > root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \ > > ip link set dev x3 up master br1 > > All multicast, including the two IPv6 groups from br0, should now be > flooded, according to the policy of br1. But instead the old > memberships are still active in the hardware database, causing the > switch to only forward traffic to those groups towards the CPU (port > 0). > > Eliminate the race in two steps: > > 1. Grab the write-side lock of the MDB while generating the replay > list. > > This prevents new memberships from showing up while we are generating > the replay list. But it leaves the scenario in which a deferred event > was already generated, but not delivered, before we grabbed the > lock. Therefore: > > 2. Make sure that no deferred version of a replay event is already > enqueued to the switchdev deferred queue, before adding it to the > replay list, when replaying additions. > > Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries") > Signed-off-by: Tobias Waldekranz <tobias@xxxxxxxxxxxxxx> > --- Excellent from my side, thank you! Reviewed-by: Vladimir Oltean <olteanv@xxxxxxxxx> > @@ -307,6 +336,50 @@ int switchdev_port_obj_del(struct net_device *dev, > } > EXPORT_SYMBOL_GPL(switchdev_port_obj_del); > > +/** > + * switchdev_port_obj_act_is_deferred - Is object action pending? > + * > + * @dev: port device > + * @nt: type of action; add or delete > + * @obj: object to test > + * > + * Returns true if a deferred item is exists, which is equivalent > + * to the action @nt of an object @obj. nitpick: replace "is exists" with something else like "is pending" or "exists". Also "action of an object" or "on an object"? > + * > + * rtnl_lock must be held. > + */