On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@xxxxxxxxx> wrote: > On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote: >> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@xxxxxxxxx> wrote: >> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote: >> >> Before this change, generation of the list of events MDB to replay >> > >> > s/events MDB/MDB events/ >> > >> >> would race against the IGMP/MLD snooping logic, which could concurrently >> > >> > logic. This could (...) >> > >> >> enqueue events to the switchdev deferred queue, leading to duplicate >> >> events being sent to drivers. As a consequence of this, drivers which >> >> reference count memberships (at least DSA), would be left with orphan >> >> groups in their hardware database when the bridge was destroyed. >> > >> > Still missing the user impact description, aka "when would this be >> > noticed by, and actively bother an user?". Something that would justify >> > handling this via net.git rather than net-next.git. >> >> Other than moving up the example to this point, what do you feel is >> missing? Or are you saying that you don't think this belongs on net.git? > > I just don't have enough data to be able to judge (it's missing from the > commit message). I'm looking at Documentation/process/stable-kernel-rules.rst > as a reference. It lists a series of guidelines from which I gather, > generally speaking, that there should exist a user impact on > functionality. The "mvls atu" command is a debug program, it just dumps > hardware tables. It is enough to point out that multicast entries leak, > but not what will be broken as a result of that. Fair enough. I originally discovered the issue while developing a kselftest for another improvement I want to make to switchdev multicast offloading (related to router ports). I thought my new code was causing these orphan entries, but then realized that the issue existed on a plain -net and -net-next kernel. I can imagine scenarios in which a user _could_ be impacted by this, but they are all quite esoteric. Maybe that is an indication that I should just target net-next instead. >> >> Avoid this by grabbing the write-side lock of the MDB while generating >> >> the replay list, making sure that no deferred version of a replay >> >> event is already enqueued to the switchdev deferred queue, before >> >> adding it to the replay list. >> > >> > The description of the solution is actually not very satisfactory to me. >> > I would have liked to see a more thorough analysis. >> >> Sorry to disappoint. >> >> > The race has 2 components, one comes from the fact that during replay, >> > we iterate using RCU, which does not halt concurrent updates, and the >> > other comes from the fact that the MDB addition procedure is non-atomic. >> > Elements are first added to the br->mdb_list, but are notified to >> > switchdev in a deferred context. >> > >> > Grabbing the bridge multicast spinlock only solves the first problem: it >> > stops new enqueues of deferred events. We also need special handling of >> > the pending deferred events. The idea there is that we cannot cancel >> > them, since that would lead to complications for other potential >> > listeners on the switchdev chain. And we cannot flush them either, since >> > that wouldn't actually help: flushing needs sleepable context, which is >> > incompatible with holding br->multicast_lock, and without >> > br->multicast_lock held, we haven't actually solved anything, since new >> > deferred events can still be enqueued at any time. >> > >> > So the only simple solution is to let the pending deferred events >> > execute (eventually), but during event replay on joining port, exclude >> > replaying those multicast elements which are in the bridge's multicast >> > list, but were not yet added through switchdev. Eventually they will be. >> >> In my opinion, your three paragraphs above say pretty much the same >> thing as my single paragraph. But sure, I will try to provide more >> details in v4. > > It's not about word count, I'm notoriously bad at summarizing. If you > could still say in a single paragraph what I did in 3, it would be great. > > The difference is that the above 3 paragraphs only explore the race on > MDB addition events. It is of a different nature that the one on deletion. > Your analysis clumps them together, and the code follows suit. There is > code to supposedly handle the race between deferred deletion events and > replay on port removal, but I don't think it does. That's the thing though: I did not lump them together, I simply did not think about the deletion issue at all. An issue which you yourself state should probably be fixed in a separate patch, presumably becuase you think of them as two very closely related, but ultimately separate, issues. Which is why I think it was needlessly harsh to say that my analysis of the duplicate-events-issue was too simplistic, because it did not address a related issue that I had not considered. >> > (side note: the handling code for excluding replays on pending event >> > deletion seems to not actually help, because) >> > >> > Event replays on a switchdev port leaving the bridge are also >> > problematic, but in a different way. The deletion procedure is also >> > non-atomic, they are first removed from br->mdb_list then the switchdev >> > notification is deferred. So, the replay procedure cannot enter a >> > condition where it replays the deletion twice. But, there is no >> > switchdev_deferred_process() call when the switchdev port unoffloads an >> > intermediary LAG bridge port, and this can lead to the situation where >> > neither the replay, nor the deferred switchdev object, trickle down to a >> > call in the switchdev driver. So for event deletion, we need to force a >> > synchronous call to switchdev_deferred_process(). >> >> Good catch! >> >> What does this mean for v4? Is this still one logical change that >> ensures that MDB events are always delivered exactly once, or a series >> where one patch fixes the duplicates and one ensures that no events >> are missed? > > I'm not saying I know how you can satisfy everyone's review requests. Of course, I was merely trying to avoid an extra round of patches. > Once it is clear that deferred additions and deferred removals need > separate treatment, it would probably be preferable to treat them in > separate patches. I should have been more clear in my response. When I said "Good catch!", that was because I had first reproduced the issue, and then verified that adding a call to switchdev_deferred_process() at the end of nbp_switchdev_unsync_objs() solves the issue. >> > See how the analysis in the commit message changes the patch? >> >> Suddenly the novice was enlightened. > > This, to me, reads like unnecessary sarcasm. Your question, to me, read like unnecessary condescension. > I just mean that I don't think you spent enough time to explore the > problem space in the commit message. It is a very important element of > a patch, because it is very rare for someone to solve a problem which > is not even formulated. I actually tried to be helpful by pointing out > where things might not work out. I'm sorry if it did not appear that way. I am extremely greatful for your help, truly! Which is why I complemented you for pointing out the second issue to me. Something about the phrasing of that question really rubbed me the wrong way though. I should have just let it go, I'm sorry.