On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@xxxxxxxxx> wrote: > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote: >> Before this change, generation of the list of events MDB to replay > > s/events MDB/MDB events/ > >> would race against the IGMP/MLD snooping logic, which could concurrently > > logic. This could (...) > >> enqueue events to the switchdev deferred queue, leading to duplicate >> events being sent to drivers. As a consequence of this, drivers which >> reference count memberships (at least DSA), would be left with orphan >> groups in their hardware database when the bridge was destroyed. > > Still missing the user impact description, aka "when would this be > noticed by, and actively bother an user?". Something that would justify > handling this via net.git rather than net-next.git. Other than moving up the example to this point, what do you feel is missing? Or are you saying that you don't think this belongs on net.git? >> >> Avoid this by grabbing the write-side lock of the MDB while generating >> the replay list, making sure that no deferred version of a replay >> event is already enqueued to the switchdev deferred queue, before >> adding it to the replay list. > > The description of the solution is actually not very satisfactory to me. > I would have liked to see a more thorough analysis. Sorry to disappoint. > The race has 2 components, one comes from the fact that during replay, > we iterate using RCU, which does not halt concurrent updates, and the > other comes from the fact that the MDB addition procedure is non-atomic. > Elements are first added to the br->mdb_list, but are notified to > switchdev in a deferred context. > > Grabbing the bridge multicast spinlock only solves the first problem: it > stops new enqueues of deferred events. We also need special handling of > the pending deferred events. The idea there is that we cannot cancel > them, since that would lead to complications for other potential > listeners on the switchdev chain. And we cannot flush them either, since > that wouldn't actually help: flushing needs sleepable context, which is > incompatible with holding br->multicast_lock, and without > br->multicast_lock held, we haven't actually solved anything, since new > deferred events can still be enqueued at any time. > > So the only simple solution is to let the pending deferred events > execute (eventually), but during event replay on joining port, exclude > replaying those multicast elements which are in the bridge's multicast > list, but were not yet added through switchdev. Eventually they will be. In my opinion, your three paragraphs above say pretty much the same thing as my single paragraph. But sure, I will try to provide more details in v4. > (side note: the handling code for excluding replays on pending event > deletion seems to not actually help, because) > > Event replays on a switchdev port leaving the bridge are also > problematic, but in a different way. The deletion procedure is also > non-atomic, they are first removed from br->mdb_list then the switchdev > notification is deferred. So, the replay procedure cannot enter a > condition where it replays the deletion twice. But, there is no > switchdev_deferred_process() call when the switchdev port unoffloads an > intermediary LAG bridge port, and this can lead to the situation where > neither the replay, nor the deferred switchdev object, trickle down to a > call in the switchdev driver. So for event deletion, we need to force a > synchronous call to switchdev_deferred_process(). Good catch! What does this mean for v4? Is this still one logical change that ensures that MDB events are always delivered exactly once, or a series where one patch fixes the duplicates and one ensures that no events are missed? > See how the analysis in the commit message changes the patch? Suddenly the novice was enlightened. >> >> An easy way to reproduce this issue, on an mv88e6xxx system, was to > > s/was/is/ > >> create a snooping bridge, and immediately add a port to it: >> >> root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ >> > ip link set dev x3 up master br0 >> root@infix-06-0b-00:~$ ip link del dev br0 >> root@infix-06-0b-00:~$ mvls atu >> ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a >> DEV:0 Marvell 88E6393X >> 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . >> 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . >> ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a >> root@infix-06-0b-00:~$ >> >> The two IPv6 groups remain in the hardware database because the >> port (x3) is notified of the host's membership twice: once via the >> original event and once via a replay. Since only a single delete >> notification is sent, the count remains at 1 when the bridge is >> destroyed. > > These 2 paragraphs, plus the mvls output, should be placed before the > "Avoid this" paragraph. > >> >> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries") >> Signed-off-by: Tobias Waldekranz <tobias@xxxxxxxxxxxxxx> >> ---