On Tue, Feb 06, 2024 at 10:58:18PM +0100, Tobias Waldekranz wrote: > On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@xxxxxxxxx> wrote: > > On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote: > >> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@xxxxxxxxx> wrote: > >> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote: > >> >> Before this change, generation of the list of events MDB to replay > >> > > >> > s/events MDB/MDB events/ > >> > > >> >> would race against the IGMP/MLD snooping logic, which could concurrently > >> > > >> > logic. This could (...) > >> > > >> >> enqueue events to the switchdev deferred queue, leading to duplicate > >> >> events being sent to drivers. As a consequence of this, drivers which > >> >> reference count memberships (at least DSA), would be left with orphan > >> >> groups in their hardware database when the bridge was destroyed. > >> > > >> > Still missing the user impact description, aka "when would this be > >> > noticed by, and actively bother an user?". Something that would justify > >> > handling this via net.git rather than net-next.git. > >> > >> Other than moving up the example to this point, what do you feel is > >> missing? Or are you saying that you don't think this belongs on net.git? > > > > I just don't have enough data to be able to judge (it's missing from the > > commit message). I'm looking at Documentation/process/stable-kernel-rules.rst > > as a reference. It lists a series of guidelines from which I gather, > > generally speaking, that there should exist a user impact on > > functionality. The "mvls atu" command is a debug program, it just dumps > > hardware tables. It is enough to point out that multicast entries leak, > > but not what will be broken as a result of that. > > Fair enough. > > I originally discovered the issue while developing a kselftest for > another improvement I want to make to switchdev multicast offloading > (related to router ports). I thought my new code was causing these > orphan entries, but then realized that the issue existed on a plain -net > and -net-next kernel. > > I can imagine scenarios in which a user _could_ be impacted by this, but > they are all quite esoteric. Maybe that is an indication that I should > just target net-next instead. Again, I'm not discouraging you to send this patch to net.git. I've been saying (since v1, apparently: https://lore.kernel.org/netdev/20240131135157.ddrtt4swvz5y3nbz@skbuf/) that if there is a reason to do it, just say it in the commit message, so that we're all on the same page. Be verbose when calculating the cost/benefit ratio calculation (preferably also in the commit message). I would analyze the complexity of the change as "medium", since it does change the locking scheme to fix an underdesigned mechanism. And the fact that mlxsw also uses replays through a slightly different code path means something you (probably) can't test, and potentially risky. > >> > The race has 2 components, one comes from the fact that during replay, > >> > we iterate using RCU, which does not halt concurrent updates, and the > >> > other comes from the fact that the MDB addition procedure is non-atomic. > >> > Elements are first added to the br->mdb_list, but are notified to > >> > switchdev in a deferred context. > >> > > >> > Grabbing the bridge multicast spinlock only solves the first problem: it > >> > stops new enqueues of deferred events. We also need special handling of > >> > the pending deferred events. The idea there is that we cannot cancel > >> > them, since that would lead to complications for other potential > >> > listeners on the switchdev chain. And we cannot flush them either, since > >> > that wouldn't actually help: flushing needs sleepable context, which is > >> > incompatible with holding br->multicast_lock, and without > >> > br->multicast_lock held, we haven't actually solved anything, since new > >> > deferred events can still be enqueued at any time. > >> > > >> > So the only simple solution is to let the pending deferred events > >> > execute (eventually), but during event replay on joining port, exclude > >> > replaying those multicast elements which are in the bridge's multicast > >> > list, but were not yet added through switchdev. Eventually they will be. > >> > >> In my opinion, your three paragraphs above say pretty much the same > >> thing as my single paragraph. But sure, I will try to provide more > >> details in v4. > > > > It's not about word count, I'm notoriously bad at summarizing. If you > > could still say in a single paragraph what I did in 3, it would be great. > > > > The difference is that the above 3 paragraphs only explore the race on > > MDB addition events. It is of a different nature that the one on deletion. > > Your analysis clumps them together, and the code follows suit. There is > > code to supposedly handle the race between deferred deletion events and > > replay on port removal, but I don't think it does. > > That's the thing though: I did not lump them together, I simply did not > think about the deletion issue at all. An issue which you yourself state > should probably be fixed in a separate patch, presumably becuase you > think of them as two very closely related, but ultimately separate, > issues. Which is why I think it was needlessly harsh to say that my > analysis of the duplicate-events-issue was too simplistic, because it > did not address a related issue that I had not considered. So in the comments for your v1, one of the things I told you was that "there's one more race to deal with" on removal. https://lore.kernel.org/netdev/20240131150642.mxcssv7l5qfiejkl@skbuf/ At the time, the idea had not crystalized in my mind either that it's not "one more" race on removal, but that on removal it's _the only_ race. You then submitted v2 and v3, not taking this comment into consideration, not even to explore it. Exploring it thoroughly would have revealed that your approach - to apply the algorithm of excluding objects from replay if events on them are pending in switchdev - is completely unnecessary when "adding=false". In light of my comment and the lack of a detailed analysis on your side on removal, it _appears_, by looking at this change, that the patch handles a race on removal, but the code actually runs through the algorithm that handles a race that doesn't exist, and doesn't handle the race that _does_ exist. My part of the fault is doing spotty review, giving ideas piecemeal, and getting frustrated you aren't picking all of them up. It's still going to be a really busy few weeks/months for me ahead, and there's nothing I can do about that. I'm faced with the alternatives of either doing a shit job reviewing and leaving comments/ideas in the little time I have, or not reviewing at all. I'll think about my options some more.