Re: [PATCH v3 net] net: bridge: switchdev: Skip MDB replays of pending events

Tobias Waldekranz <tobias@xxxxxxxxxxxxxx> · Tue, 06 Feb 2024 22:58:18 +0100

On tis, feb 06, 2024 at 21:31, Vladimir Oltean <olteanv@xxxxxxxxx> wrote:
> On Tue, Feb 06, 2024 at 03:54:25PM +0100, Tobias Waldekranz wrote:
>> On mån, feb 05, 2024 at 13:41, Vladimir Oltean <olteanv@xxxxxxxxx> wrote:
>> > On Thu, Feb 01, 2024 at 05:10:45PM +0100, Tobias Waldekranz wrote:
>> >> Before this change, generation of the list of events MDB to replay
>> >
>> > s/events MDB/MDB events/
>> >
>> >> would race against the IGMP/MLD snooping logic, which could concurrently
>> >
>> > logic. This could (...)
>> >
>> >> enqueue events to the switchdev deferred queue, leading to duplicate
>> >> events being sent to drivers. As a consequence of this, drivers which
>> >> reference count memberships (at least DSA), would be left with orphan
>> >> groups in their hardware database when the bridge was destroyed.
>> >
>> > Still missing the user impact description, aka "when would this be
>> > noticed by, and actively bother an user?". Something that would justify
>> > handling this via net.git rather than net-next.git.
>> 
>> Other than moving up the example to this point, what do you feel is
>> missing? Or are you saying that you don't think this belongs on net.git?
>
> I just don't have enough data to be able to judge (it's missing from the
> commit message). I'm looking at Documentation/process/stable-kernel-rules.rst
> as a reference. It lists a series of guidelines from which I gather,
> generally speaking, that there should exist a user impact on
> functionality. The "mvls atu" command is a debug program, it just dumps
> hardware tables. It is enough to point out that multicast entries leak,
> but not what will be broken as a result of that.

Fair enough.

I originally discovered the issue while developing a kselftest for
another improvement I want to make to switchdev multicast offloading
(related to router ports). I thought my new code was causing these
orphan entries, but then realized that the issue existed on a plain -net
and -net-next kernel.

I can imagine scenarios in which a user _could_ be impacted by this, but
they are all quite esoteric. Maybe that is an indication that I should
just target net-next instead.

>> >> Avoid this by grabbing the write-side lock of the MDB while generating
>> >> the replay list, making sure that no deferred version of a replay
>> >> event is already enqueued to the switchdev deferred queue, before
>> >> adding it to the replay list.
>> >
>> > The description of the solution is actually not very satisfactory to me.
>> > I would have liked to see a more thorough analysis.
>> 
>> Sorry to disappoint.
>> 
>> > The race has 2 components, one comes from the fact that during replay,
>> > we iterate using RCU, which does not halt concurrent updates, and the
>> > other comes from the fact that the MDB addition procedure is non-atomic.
>> > Elements are first added to the br->mdb_list, but are notified to
>> > switchdev in a deferred context.
>> >
>> > Grabbing the bridge multicast spinlock only solves the first problem: it
>> > stops new enqueues of deferred events. We also need special handling of
>> > the pending deferred events. The idea there is that we cannot cancel
>> > them, since that would lead to complications for other potential
>> > listeners on the switchdev chain. And we cannot flush them either, since
>> > that wouldn't actually help: flushing needs sleepable context, which is
>> > incompatible with holding br->multicast_lock, and without
>> > br->multicast_lock held, we haven't actually solved anything, since new
>> > deferred events can still be enqueued at any time.
>> >
>> > So the only simple solution is to let the pending deferred events
>> > execute (eventually), but during event replay on joining port, exclude
>> > replaying those multicast elements which are in the bridge's multicast
>> > list, but were not yet added through switchdev. Eventually they will be.
>> 
>> In my opinion, your three paragraphs above say pretty much the same
>> thing as my single paragraph. But sure, I will try to provide more
>> details in v4.
>
> It's not about word count, I'm notoriously bad at summarizing. If you
> could still say in a single paragraph what I did in 3, it would be great.
>
> The difference is that the above 3 paragraphs only explore the race on
> MDB addition events. It is of a different nature that the one on deletion.
> Your analysis clumps them together, and the code follows suit. There is
> code to supposedly handle the race between deferred deletion events and
> replay on port removal, but I don't think it does.

That's the thing though: I did not lump them together, I simply did not
think about the deletion issue at all. An issue which you yourself state
should probably be fixed in a separate patch, presumably becuase you
think of them as two very closely related, but ultimately separate,
issues. Which is why I think it was needlessly harsh to say that my
analysis of the duplicate-events-issue was too simplistic, because it
did not address a related issue that I had not considered.

>> > (side note: the handling code for excluding replays on pending event
>> > deletion seems to not actually help, because)
>> >
>> > Event replays on a switchdev port leaving the bridge are also
>> > problematic, but in a different way. The deletion procedure is also
>> > non-atomic, they are first removed from br->mdb_list then the switchdev
>> > notification is deferred. So, the replay procedure cannot enter a
>> > condition where it replays the deletion twice. But, there is no
>> > switchdev_deferred_process() call when the switchdev port unoffloads an
>> > intermediary LAG bridge port, and this can lead to the situation where
>> > neither the replay, nor the deferred switchdev object, trickle down to a
>> > call in the switchdev driver. So for event deletion, we need to force a
>> > synchronous call to switchdev_deferred_process().
>> 
>> Good catch!
>> 
>> What does this mean for v4? Is this still one logical change that
>> ensures that MDB events are always delivered exactly once, or a series
>> where one patch fixes the duplicates and one ensures that no events
>> are missed?
>
> I'm not saying I know how you can satisfy everyone's review requests.

Of course, I was merely trying to avoid an extra round of patches.

> Once it is clear that deferred additions and deferred removals need
> separate treatment, it would probably be preferable to treat them in
> separate patches.

I should have been more clear in my response. When I said "Good catch!",
that was because I had first reproduced the issue, and then verified
that adding a call to switchdev_deferred_process() at the end of
nbp_switchdev_unsync_objs() solves the issue.

>> > See how the analysis in the commit message changes the patch?
>> 
>> Suddenly the novice was enlightened.
>
> This, to me, reads like unnecessary sarcasm.

Your question, to me, read like unnecessary condescension.

> I just mean that I don't think you spent enough time to explore the
> problem space in the commit message. It is a very important element of
> a patch, because it is very rare for someone to solve a problem which
> is not even formulated. I actually tried to be helpful by pointing out
> where things might not work out. I'm sorry if it did not appear that way.

I am extremely greatful for your help, truly! Which is why I
complemented you for pointing out the second issue to me. Something
about the phrasing of that question really rubbed me the wrong way
though. I should have just let it go, I'm sorry.