On Fri, Mar 19, 2021 at 01:18, Vladimir Oltean <olteanv@xxxxxxxxx> wrote: > From: Vladimir Oltean <vladimir.oltean@xxxxxxx> > > On reception of an skb, the bridge checks if it was marked as 'already > forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it > is, it puts a mark of its own on that skb, with the switchdev mark of > the ingress port. Then during forwarding, it enforces that the egress > port must have a different switchdev mark than the ingress one (this is > done in nbp_switchdev_allowed_egress). > > Non-switchdev drivers don't report any physical switch id (neither > through devlink nor .ndo_get_port_parent_id), therefore the bridge > assigns them a switchdev mark of 0, and packets coming from them will > always have skb->offload_fwd_mark = 0. So there aren't any restrictions. > > Problems appear due to the fact that DSA would like to perform software > fallback for bonding and team interfaces that the physical switch cannot > offload. > > +-- br0 -+ > / / | \ > / / | \ > / / | \ > / / | \ > / / | \ > / | | bond0 > / | | / \ > swp0 swp1 swp2 swp3 swp4 > > There, it is desirable that the presence of swp3 and swp4 under a > non-offloaded LAG does not preclude us from doing hardware bridging > beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high > enough that software bridging between {swp0,swp1,swp2} and bond0 is not > impractical. > > But this creates an impossible paradox given the current way in which > port switchdev marks are assigned. When the driver receives a packet > from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to > something. > > - If we set it to 0, then the bridge will forward it towards swp1, swp2 > and bond0. But the switch has already forwarded it towards swp1 and > swp2 (not to bond0, remember, that isn't offloaded, so as far as the > switch is concerned, ports swp3 and swp4 are not looking up the FDB, > and the entire bond0 is a destination that is strictly behind the > CPU). But we don't want duplicated traffic towards swp1 and swp2, so > it's not ok to set skb->offload_fwd_mark = 0. > > - If we set it to 1, then the bridge will not forward the skb towards > the ports with the same switchdev mark, i.e. not to swp1, swp2 and > bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should > have forwarded the skb there. > > So the real issue is that bond0 will be assigned the same switchdev mark > as {swp0,swp1,swp2}, because the function that assigns switchdev marks > to bridge ports, nbp_switchdev_mark_set, recurses through bond0's lower > interfaces until it finds something that implements devlink. > > A solution is to give the bridge explicit hints as to what switchdev > mark it should use for each port. > > Currently, the bridging offload is very 'silent': a driver registers a > netdevice notifier, which is put on the netns's notifier chain, and > which sniffs around for NETDEV_CHANGEUPPER events where the upper is a > bridge, and the lower is an interface it knows about (one registered by > this driver, normally). Then, from within that notifier, it does a bunch > of stuff behind the bridge's back, without the bridge necessarily > knowing that there's somebody offloading that port. It looks like this: > > ip link set swp0 master br0 > | > v > bridge calls netdev_master_upper_dev_link > | > v > call_netdevice_notifiers > | > v > dsa_slave_netdevice_event > | > v > oh, hey! it's for me! > | > v > .port_bridge_join > > What we do to solve the conundrum is to be less silent, and emit a > notification back. Something like this: > > ip link set swp0 master br0 > | > v > bridge calls netdev_master_upper_dev_link > | > v bridge: Aye! I'll use this > call_netdevice_notifiers ^ ppid as the > | | switchdev mark for > v | this port, and zero > dsa_slave_netdevice_event | if I got nothing. > | | > v | > oh, hey! it's for me! | > | | > v | > .port_bridge_join | > | | > +------------------------+ > switchdev_bridge_port_offload(swp0) > > Then stacked interfaces (like bond0 on top of swp3/swp4) would be > treated differently in DSA, depending on whether we can or cannot > offload them. > > The offload case: > > ip link set bond0 master br0 > | > v > bridge calls netdev_master_upper_dev_link > | > v bridge: Aye! I'll use this > call_netdevice_notifiers ^ ppid as the > | | switchdev mark for > v | bond0. > dsa_slave_netdevice_event | Coincidentally (or not), > | | bond0 and swp0, swp1, swp2 > v | all have the same switchdev > hmm, it's not quite for me, | mark now, since the ASIC > but my driver has already | is able to forward towards > called .port_lag_join | all these ports in hw. > for it, because I have | > a port with dp->lag_dev == bond0. | > | | > v | > .port_bridge_join | > for swp3 and swp4 | > | | > +------------------------+ > switchdev_bridge_port_offload(bond0) > > And the non-offload case: > > ip link set bond0 master br0 > | > v > bridge calls netdev_master_upper_dev_link > | > v bridge waiting: > call_netdevice_notifiers ^ huh, switchdev_bridge_port_offload > | | wasn't called, okay, I'll use a > v | switchdev mark of zero for this one. > dsa_slave_netdevice_event : Then packets received on swp0 will > | : not be forwarded towards swp1, but > v : they will towards bond0. > it's not for me, but > bond0 is an upper of swp3 > and swp4, but their dp->lag_dev > is NULL because they couldn't > offload it. > > Basically we can draw the conclusion that the lowers of a bridge port > can come and go, so depending on the configuration of lowers for a > bridge port, it can dynamically toggle between offloaded and unoffloaded. > Therefore, we need an equivalent switchdev_bridge_port_unoffload too. > > This patch changes the way any switchdev driver interacts with the > bridge. From now on, everybody needs to call switchdev_bridge_port_offload, > otherwise the bridge will treat the port as non-offloaded and allow > software flooding to other ports from the same ASIC. > > Signed-off-by: Vladimir Oltean <vladimir.oltean@xxxxxxx> > --- > .../ethernet/freescale/dpaa2/dpaa2-switch.c | 4 +- > .../marvell/prestera/prestera_switchdev.c | 7 ++ > .../mellanox/mlxsw/spectrum_switchdev.c | 4 +- > drivers/net/ethernet/mscc/ocelot_net.c | 4 +- > drivers/net/ethernet/rocker/rocker_ofdpa.c | 8 +- > drivers/net/ethernet/ti/am65-cpsw-nuss.c | 7 +- > drivers/net/ethernet/ti/cpsw_new.c | 6 +- Why is not net/dsa included in this change? > include/linux/if_bridge.h | 16 ++++ > net/bridge/br_if.c | 11 +-- > net/bridge/br_private.h | 8 +- > net/bridge/br_switchdev.c | 94 ++++++++++++++++--- > 11 files changed, 138 insertions(+), 31 deletions(-) > > diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c > index 2fd05dd18d46..f20556178e33 100644 > --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c > +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c > @@ -1518,7 +1518,7 @@ static int dpaa2_switch_port_bridge_join(struct net_device *netdev, > if (err) > goto err_egress_flood; > > - return 0; > + return switchdev_bridge_port_offload(netdev, NULL); > > err_egress_flood: > dpaa2_switch_port_set_fdb(port_priv, NULL); > @@ -1552,6 +1552,8 @@ static int dpaa2_switch_port_bridge_leave(struct net_device *netdev) > struct ethsw_core *ethsw = port_priv->ethsw_data; > int err; > > + switchdev_bridge_port_unoffload(netdev); > + > /* First of all, fast age any learn FDB addresses on this switch port */ > dpaa2_switch_port_fast_age(port_priv); > > diff --git a/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c b/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c > index 49e052273f30..0b0d5db7b85b 100644 > --- a/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c > +++ b/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c > @@ -443,6 +443,10 @@ static int prestera_port_bridge_join(struct prestera_port *port, > goto err_brport_create; > } > > + err = switchdev_bridge_port_offload(port->dev, NULL); > + if (err) > + goto err_brport_offload; > + > if (bridge->vlan_enabled) > return 0; > > @@ -453,6 +457,7 @@ static int prestera_port_bridge_join(struct prestera_port *port, > return 0; > > err_port_join: > +err_brport_offload: > prestera_bridge_port_put(br_port); > err_brport_create: > prestera_bridge_put(bridge); > @@ -520,6 +525,8 @@ static void prestera_port_bridge_leave(struct prestera_port *port, > if (!br_port) > return; > > + switchdev_bridge_port_unoffload(port->dev); > + > bridge = br_port->bridge; > > if (bridge->vlan_enabled) > diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c > index 23b7e8d6386b..7fa0b3653819 100644 > --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c > +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c > @@ -2326,7 +2326,7 @@ int mlxsw_sp_port_bridge_join(struct mlxsw_sp_port *mlxsw_sp_port, > if (err) > goto err_port_join; > > - return 0; > + return switchdev_bridge_port_offload(brport_dev, extack); > > err_port_join: > mlxsw_sp_bridge_port_put(mlxsw_sp->bridge, bridge_port); > @@ -2348,6 +2348,8 @@ void mlxsw_sp_port_bridge_leave(struct mlxsw_sp_port *mlxsw_sp_port, > if (!bridge_port) > return; > > + switchdev_bridge_port_unoffload(brport_dev); > + > bridge_device->ops->port_leave(bridge_device, bridge_port, > mlxsw_sp_port); > mlxsw_sp_bridge_port_put(mlxsw_sp->bridge, bridge_port); > diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c > index d38ffc7cf5f0..b917d9dd8a6a 100644 > --- a/drivers/net/ethernet/mscc/ocelot_net.c > +++ b/drivers/net/ethernet/mscc/ocelot_net.c > @@ -1213,7 +1213,7 @@ static int ocelot_netdevice_bridge_join(struct net_device *dev, > if (err) > goto err_switchdev_sync; > > - return 0; > + return switchdev_bridge_port_offload(brport_dev, extack); > > err_switchdev_sync: > ocelot_port_bridge_leave(ocelot, port, bridge); > @@ -1234,6 +1234,8 @@ static int ocelot_netdevice_bridge_leave(struct net_device *dev, > if (err) > return err; > > + switchdev_bridge_port_unoffload(brport_dev); > + > ocelot_port_bridge_leave(ocelot, port, bridge); > > return 0; > diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c > index 967a634ee9ac..9b6d7cac112b 100644 > --- a/drivers/net/ethernet/rocker/rocker_ofdpa.c > +++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c > @@ -2592,13 +2592,19 @@ static int ofdpa_port_bridge_join(struct ofdpa_port *ofdpa_port, > > ofdpa_port->bridge_dev = bridge; > > - return ofdpa_port_vlan_add(ofdpa_port, OFDPA_UNTAGGED_VID, 0); > + err = ofdpa_port_vlan_add(ofdpa_port, OFDPA_UNTAGGED_VID, 0); > + if (err) > + return err; > + > + return switchdev_bridge_port_offload(ofdpa_port->dev, NULL); > } > > static int ofdpa_port_bridge_leave(struct ofdpa_port *ofdpa_port) > { > int err; > > + switchdev_bridge_port_unoffload(ofdpa_port->dev); > + > err = ofdpa_port_vlan_del(ofdpa_port, OFDPA_UNTAGGED_VID, 0); > if (err) > return err; > diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c > index 638d7b03be4b..fe2e38971acc 100644 > --- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c > +++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c > @@ -7,6 +7,7 @@ > > #include <linux/clk.h> > #include <linux/etherdevice.h> > +#include <linux/if_bridge.h> > #include <linux/if_vlan.h> > #include <linux/interrupt.h> > #include <linux/kernel.h> > @@ -2082,6 +2083,7 @@ static int am65_cpsw_netdevice_port_link(struct net_device *ndev, struct net_dev > { > struct am65_cpsw_common *common = am65_ndev_to_common(ndev); > struct am65_cpsw_ndev_priv *priv = am65_ndev_to_priv(ndev); > + int err; > > if (!common->br_members) { > common->hw_bridge_dev = br_ndev; > @@ -2097,7 +2099,8 @@ static int am65_cpsw_netdevice_port_link(struct net_device *ndev, struct net_dev > > am65_cpsw_port_offload_fwd_mark_update(common); > > - return NOTIFY_DONE; > + err = switchdev_bridge_port_offload(ndev, NULL); > + return notifier_to_errno(err); > } > > static void am65_cpsw_netdevice_port_unlink(struct net_device *ndev) > @@ -2105,6 +2108,8 @@ static void am65_cpsw_netdevice_port_unlink(struct net_device *ndev) > struct am65_cpsw_common *common = am65_ndev_to_common(ndev); > struct am65_cpsw_ndev_priv *priv = am65_ndev_to_priv(ndev); > > + switchdev_bridge_port_unoffload(ndev); > + > common->br_members &= ~BIT(priv->port->port_id); > > am65_cpsw_port_offload_fwd_mark_update(common); > diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c > index 58a64313ac00..6347532fb39d 100644 > --- a/drivers/net/ethernet/ti/cpsw_new.c > +++ b/drivers/net/ethernet/ti/cpsw_new.c > @@ -1508,6 +1508,7 @@ static int cpsw_netdevice_port_link(struct net_device *ndev, > { > struct cpsw_priv *priv = netdev_priv(ndev); > struct cpsw_common *cpsw = priv->cpsw; > + int err; > > if (!cpsw->br_members) { > cpsw->hw_bridge_dev = br_ndev; > @@ -1523,7 +1524,8 @@ static int cpsw_netdevice_port_link(struct net_device *ndev, > > cpsw_port_offload_fwd_mark_update(cpsw); > > - return NOTIFY_DONE; > + err = switchdev_bridge_port_offload(ndev, NULL); > + return notifier_to_errno(err); > } > > static void cpsw_netdevice_port_unlink(struct net_device *ndev) > @@ -1531,6 +1533,8 @@ static void cpsw_netdevice_port_unlink(struct net_device *ndev) > struct cpsw_priv *priv = netdev_priv(ndev); > struct cpsw_common *cpsw = priv->cpsw; > > + switchdev_bridge_port_unoffload(ndev); > + > cpsw->br_members &= ~BIT(priv->emac_port); > > cpsw_port_offload_fwd_mark_update(cpsw); > diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h > index ea176c508c0d..4fbee6d5fc16 100644 > --- a/include/linux/if_bridge.h > +++ b/include/linux/if_bridge.h > @@ -196,4 +196,20 @@ static inline int br_fdb_replay(struct net_device *br_dev, > } > #endif > > +#if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_NET_SWITCHDEV) > +int switchdev_bridge_port_offload(struct net_device *dev, > + struct netlink_ext_ack *extack); > +int switchdev_bridge_port_unoffload(struct net_device *dev); > +#else > +int switchdev_bridge_port_offload(struct net_device *dev, > + struct netlink_ext_ack *extack) > +{ > + return 0; > +} > + > +int switchdev_bridge_port_unoffload(struct net_device *dev) > +{ > +} > +#endif > + > #endif > diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c > index f7d2f472ae24..930a09f27e0d 100644 > --- a/net/bridge/br_if.c > +++ b/net/bridge/br_if.c > @@ -643,10 +643,6 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, > if (err) > goto err5; > > - err = nbp_switchdev_mark_set(p); > - if (err) > - goto err6; > - > dev_disable_lro(dev); > > list_add_rcu(&p->list, &br->port_list); > @@ -671,13 +667,13 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, > */ > err = dev_pre_changeaddr_notify(br->dev, dev->dev_addr, extack); > if (err) > - goto err7; > + goto err6; > } > > err = nbp_vlan_init(p, extack); > if (err) { > netdev_err(dev, "failed to initialize vlan filtering on this port\n"); > - goto err7; > + goto err6; > } > > spin_lock_bh(&br->lock); > @@ -700,11 +696,10 @@ int br_add_if(struct net_bridge *br, struct net_device *dev, > > return 0; > > -err7: > +err6: > list_del_rcu(&p->list); > br_fdb_delete_by_port(br, p, 0, 1); > nbp_update_port_count(br); > -err6: > netdev_upper_dev_unlink(dev, br->dev); > err5: > dev->priv_flags &= ~IFF_BRIDGE_PORT; > diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h > index d7d167e10b70..1982b5887d0f 100644 > --- a/net/bridge/br_private.h > +++ b/net/bridge/br_private.h > @@ -326,8 +326,10 @@ struct net_bridge_port { > #ifdef CONFIG_NET_POLL_CONTROLLER > struct netpoll *np; > #endif > + int offload_count; Should this be conditional on CONFIG_NET_SWITCHDEV? > #ifdef CONFIG_NET_SWITCHDEV > int offload_fwd_mark; > + struct netdev_phys_item_id ppid; > #endif > u16 group_fwd_mask; > u16 backup_redirected_cnt; > @@ -1572,7 +1574,6 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; } > > /* br_switchdev.c */ > #ifdef CONFIG_NET_SWITCHDEV > -int nbp_switchdev_mark_set(struct net_bridge_port *p); > void nbp_switchdev_frame_mark(const struct net_bridge_port *p, > struct sk_buff *skb); > bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p, > @@ -1592,11 +1593,6 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb) > skb->offload_fwd_mark = 0; > } > #else > -static inline int nbp_switchdev_mark_set(struct net_bridge_port *p) > -{ > - return 0; > -} > - > static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p, > struct sk_buff *skb) > { > diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c > index b89503832fcc..4cf7902f056c 100644 > --- a/net/bridge/br_switchdev.c > +++ b/net/bridge/br_switchdev.c > @@ -8,37 +8,109 @@ > > #include "br_private.h" > > -static int br_switchdev_mark_get(struct net_bridge *br, struct net_device *dev) > +static int br_switchdev_mark_get(struct net_bridge *br, > + struct net_bridge_port *new_nbp) > { > struct net_bridge_port *p; > > /* dev is yet to be added to the port list. */ > list_for_each_entry(p, &br->port_list, list) { > - if (netdev_port_same_parent_id(dev, p->dev)) > + if (!p->offload_count) > + continue; > + > + if (netdev_phys_item_id_same(&p->ppid, &new_nbp->ppid)) > return p->offload_fwd_mark; > } > > return ++br->offload_fwd_mark; > } > > -int nbp_switchdev_mark_set(struct net_bridge_port *p) > +static int nbp_switchdev_mark_set(struct net_bridge_port *p, > + struct netdev_phys_item_id ppid, > + struct netlink_ext_ack *extack) > +{ > + if (p->offload_count) { > + /* Prevent unsupported configurations such as a bridge port > + * which is a bonding interface, and the member ports are from > + * different hardware switches. > + */ > + if (!netdev_phys_item_id_same(&p->ppid, &ppid)) { > + NL_SET_ERR_MSG_MOD(extack, > + "Same bridge port cannot be offloaded by two physical switches"); > + return -EBUSY; > + } > + /* Be tolerant with drivers that call SWITCHDEV_BRPORT_OFFLOADED > + * more than once for the same bridge port, such as when the > + * bridge port is an offloaded bonding/team interface. > + */ > + p->offload_count++; > + return 0; > + } > + > + p->ppid = ppid; > + p->offload_count = 1; > + p->offload_fwd_mark = br_switchdev_mark_get(p->br, p); > + > + return 0; > +} > + > +static void nbp_switchdev_mark_clear(struct net_bridge_port *p, > + struct netdev_phys_item_id ppid) > +{ > + if (WARN_ON(!netdev_phys_item_id_same(&p->ppid, &ppid))) > + return; > + if (WARN_ON(!p->offload_count)) > + return; > + > + p->offload_count--; > + if (p->offload_count) > + return; > + > + p->offload_fwd_mark = 0; > +} > + > +/* Let the bridge know that this port is offloaded, so that it can use > + * the port parent id obtained by recursion to determine the bridge > + * port's switchdev mark. > + */ > +int switchdev_bridge_port_offload(struct net_device *dev, > + struct netlink_ext_ack *extack) > { > - struct netdev_phys_item_id ppid = { }; > + struct netdev_phys_item_id ppid; > + struct net_bridge_port *p; > int err; > > - ASSERT_RTNL(); > + p = br_port_get_rtnl(dev); > + if (!p) > + return -ENODEV; > > - err = dev_get_port_parent_id(p->dev, &ppid, true); > - if (err) { > - if (err == -EOPNOTSUPP) > - return 0; > + err = dev_get_port_parent_id(dev, &ppid, true); > + if (err) > + return err; > + > + return nbp_switchdev_mark_set(p, ppid, extack); > +} > +EXPORT_SYMBOL_GPL(switchdev_bridge_port_offload); > + > +int switchdev_bridge_port_unoffload(struct net_device *dev) > +{ > + struct netdev_phys_item_id ppid; > + struct net_bridge_port *p; > + int err; > + Should we ASSERT_RTNL here as well? > + p = br_port_get_rtnl(dev); > + if (!p) > + return -ENODEV; > + > + err = dev_get_port_parent_id(dev, &ppid, true); > + if (err) > return err; > - } > > - p->offload_fwd_mark = br_switchdev_mark_get(p->br, p->dev); > + nbp_switchdev_mark_clear(p, ppid); > > return 0; > } > +EXPORT_SYMBOL_GPL(switchdev_bridge_port_unoffload); > > void nbp_switchdev_frame_mark(const struct net_bridge_port *p, > struct sk_buff *skb) > -- > 2.25.1