Hi Vladimir,
On 4/5/2024 7:07 AM, Vladimir Oltean wrote:
On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
Hi Vladimir,
On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
Can you comment on the feasibility/infeasibility of Tobias' proposal of:
"The bridge could just provide some MDB iterator to save us from having
to cache all the configured groups."?
https://lore.kernel.org/netdev/87sg31n04a.fsf@xxxxxxxxxxxxxx/
What is done here will have to be scaled to many drivers - potentially
all existing DSA ones, as far as I'm aware.
I thought about implementing an MDB iterator as suggested by Tobias, but I'm
a bit concerned about the coherence of these MDB objects. In theory, when
the device driver is trying to act on an event, the source of the trigger
may have changed its state in the bridge already.
Yes, this is the result of SWITCHDEV_F_DEFER, used by both
SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.
If, upon receiving an event in the device driver, we iterate over what
the bridge has at that instant, the differences between the worlds as
seen by the bridge and the device driver might lead to some unexpected
results.
Translated: iterating over bridge MDB objects needs to be serialized
with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
events are temporarily blocked, the pending ones need to be flushed
using switchdev_deferred_process(), so resync the bridge state with the
driver state. Once the resync is done, the iteration is safe until
rtnl_unlock().
Applied to our case, the MDB iterator is needed in mv88e6xxx_port_mrouter().
This is already called with rtnl_lock() acquired. The resync procedure
will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
through switchdev_deferred_process(), and then the walk is consistent
for the remainder of the mv88e6xxx_port_mrouter() function.
A helper which does this is what would be required - an iterator
function which calls an int (*cb)(struct net_device *brport, const struct switchdev_obj_port_mdb *mdb)
for each MDB entry. The DSA core could then offer some post-processing
services over this API, to recover the struct dsa_port associated with
the bridge port (in the LAG case they aren't the same) and the address
database associated with the bridge.
Do you think there would be unexpected results even if we did this?
br_switchdev_mdb_replay() needs to handle a similarly complicated
situation of synchronizing with deferred MDB events.
>> However, if we cache the MDB objects in the device driver, at least
the order in which the events took place will be coherent and at any
give time the state of the MDB objects in the device driver can be
guaranteed to be sane. This is also the approach the prestera device
driver took.
Not contesting this, but I wouldn't like to see MDBs cached in each
device driver just for this. Switchdev is not very high on the list of
APIs which are easy to use, and making MDB caching a requirement
(for the common case that MDB entry destinations need software fixups
with the mrouter ports) isn't exactly going to make that any better.
Others' opinion may differ, but mine is that core offload APIs need to
consider what hardware is available in the real world, make the common
case easy, and the advanced cases possible. Rather than make every case
"advanced" :)
Just throwing some random ideas out there. Do you think it would make
more sense if this whole solution (rtnl_lock, iterator cb,...etc.) is
moved up to DSA so that other DSA drivers could benefit from it? I
thought about implement it (not the iterator, the current form) in DSA
at first, but I'm not sure how other drivers would behave, so I did it
with mv instead.
I guess the question is, is the current limitation (mrouter not properly
offloaded) an issue specific to mv or is it a limitation of hardware
offloading in general? I tend to think it's the latter.
But then again, if we move it to DSA, we would lose the benefit of the
optimization of consolidating multiple register writes into just one (as
done in patch 10 currently), unless we add a new switch op which takes a
portvec instead of a port when modifying mdb's.