Re: [RFC PATCH net-next 00/19] bridge: mcast: Extensions for EVPN

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 18/10/2022 15:04, Ido Schimmel wrote:
> Posting as RFC to show the whole picture. Will split into smaller
> submissions for v1 and add selftests.
> 
> tl;dr
> =====
> 
> This patchset creates feature parity between user space and the kernel
> and allows the former to install and replace MDB port group entries with
> a source list and associated filter mode. This is required for EVPN use
> cases where multicast state is not derived from snooped IGMP/MLD
> packets, but instead derived from EVPN routes exchanged by the control
> plane in user space.
> 
> Background
> ==========
> 
> IGMPv3 [1] and MLDv2 [2] differ from earlier versions of the protocols
> in that they add support for source-specific multicast. That is, hosts
> can advertise interest in listening to a particular multicast address
> only from specific source addresses or from all sources except for
> specific source addresses.
> 
> In kernel 5.10 [3][4], the bridge driver gained the ability to snoop
> IGMPv3/MLDv2 packets and install corresponding MDB port group entries.
> For example, a snooped IGMPv3 Membership Report that contains a single
> MODE_IS_EXCLUDE record for group 239.10.10.10 with sources 192.0.2.1,
> 192.0.2.2, 192.0.2.20 and 192.0.2.21 would trigger the creation of these
> entries:
> 
>  # bridge -d mdb show
>  dev br0 port veth1 grp 239.10.10.10 src 192.0.2.21 temp filter_mode include proto kernel  blocked
>  dev br0 port veth1 grp 239.10.10.10 src 192.0.2.20 temp filter_mode include proto kernel  blocked
>  dev br0 port veth1 grp 239.10.10.10 src 192.0.2.2 temp filter_mode include proto kernel  blocked
>  dev br0 port veth1 grp 239.10.10.10 src 192.0.2.1 temp filter_mode include proto kernel  blocked
>  dev br0 port veth1 grp 239.10.10.10 temp filter_mode exclude source_list 192.0.2.21/0.00,192.0.2.20/0.00,192.0.2.2/0.00,192.0.2.1/0.00 proto kernel
> 
> While the kernel can install and replace entries with a filter mode and
> source list, user space cannot. It can only add EXCLUDE entries with an
> empty source list, which is sufficient for IGMPv2/MLDv1, but not for
> IGMPv3/MLDv2.
> 
> Use cases where the multicast state is not derived from snooped packets,
> but instead derived from routes exchanged by the user space control
> plane require feature parity between user space and the kernel in terms
> of MDB configuration. Such a use case is detailed in the next section.
> 
> Motivation
> ==========
> 
> RFC 7432 [5] defines a "MAC/IP Advertisement route" (type 2) [6] that
> allows NVE switches in the EVPN network to advertise and learn
> reachability information for unicast MAC addresses. Traffic destined to
> a unicast MAC address can therefore be selectively forwarded to a single
> NVE switch behind which the MAC is located.
> 
> The same is not true for IP multicast traffic. Such traffic is simply
> flooded as BUM to all NVE switches in the broadcast domain (BD),
> regardless if a switch has interested receivers for the multicast stream
> or not. This is especially problematic for overlay networks that make
> heavy use of multicast.
> 
> The issue is addressed by RFC 9251 [7] that defines a "Selective
> Multicast Ethernet Tag Route" (type 6) [8] which allows NVE switches in
> the EVPN network to advertise multicast streams that they are interested
> in. This is done by having each switch suppress IGMP/MLD packets from
> being transmitted to the NVE network and instead communicate the
> information over BGP to other switches.
> 
> As far as the bridge driver is concerned, the above means that the
> multicast state (i.e., {multicast address, group timer, filter-mode,
> (source records)}) for the VXLAN bridge port is not populated by the
> kernel from snooped IGMP/MLD packets (they are suppressed), but instead
> by user space. Specifically, by the routing daemon that is exchanging
> EVPN routes with other NVE switches.
> 
> Changes are obviously also required in the VXLAN driver, but they are
> the subject of future patchsets. See the "Future work" section.
> 
> Implementation
> ==============
> 
> The user interface is extended to allow user space to specify the filter
> mode of the MDB port group entry and its source list. Replace support is
> also added so that user space would not need to remove an entry and
> re-add it only to edit its source list or filter mode, as that would
> result in packet loss. Example usage:
> 
>  # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent \
> 	source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
>  # bridge -d -s mdb show
>  dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra  blocked    0.00
>  dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra  blocked    0.00
>  dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra     0.00
> 
> The netlink interface is extended with a few new attributes in the
> RTM_NEWMDB request message:
> 
> [ struct nlmsghdr ]
> [ struct br_port_msg ]
> [ MDBA_SET_ENTRY ]
> 	struct br_mdb_entry
> [ MDBA_SET_ENTRY_ATTRS ]
> 	[ MDBE_ATTR_SOURCE ]
> 		struct in_addr / struct in6_addr
> 	[ MDBE_ATTR_SRC_LIST ]		// new
> 		[ MDBE_SRC_LIST_ENTRY ]
> 			[ MDBE_SRCATTR_ADDRESS ]
> 				struct in_addr / struct in6_addr
> 		[ ...]
> 	[ MDBE_ATTR_GROUP_MODE ]	// new
> 		u8
> 	[ MDBE_ATTR_RTPORT ]		// new
> 		u8
> 
> No changes are required in RTM_NEWMDB responses and notifications, as
> all the information can already be dumped by the kernel today.
> 
> Testing
> =======
> 
> Tested with existing bridge multicast selftests: bridge_igmp.sh,
> bridge_mdb_port_down.sh, bridge_mdb.sh, bridge_mld.sh,
> bridge_vlan_mcast.sh.
> 
> Will add dedicated selftests in v1.
> 
> Patchset overview
> =================
> 
> Patches #1-#8 are non-functional preparations aimed at making it easier
> to add additional netlink attributes later on. They create an MDB
> configuration structure into which netlink messages are parsed into.
> The structure is then passed in the entry creation / deletion call chain
> instead of passing the netlink attributes themselves. The same pattern
> is used by other rtnetlink objects such as routes and nexthops.
> 
> I initially tried to extend the current code, but it proved to be too
> difficult, which is why I decided to refactor it to the extensible and
> familiar pattern used by other rtnetlink objects.
> 
> The plan is to submit these patches separately for v1.
> 
> Patches #9-#15 are an additional set of non-functional preparations for
> the core changes in the last patches.
> 
> Patches #16-#17 allow user space to install (*, G) entries with a source
> list and associated filter mode. Specifically, patch #16 adds the
> necessary kernel plumbing and patch #17 exposes the new functionality to
> user space via a few new attributes.
> 
> Patch #18 allows user space to specify the routing protocol of new MDB
> port group entries so that a routing daemon could differentiate between
> entries installed by it and those installed by an administrator.
> 
> Patch #19 allows user space to replace MDB port group entries. This is
> useful, for example, when user space wants to add a new source to a
> source list. Instead of deleting a (*, G) entry and re-adding it with an
> extended source list (which would result in packet loss), user space can
> simply replace the current entry.
> 
> Future work
> ===========
> 
> The VXLAN driver will need to be extended with an MDB so that it could
> selectively forward IP multicast traffic to NVE switches with interested
> receivers instead of simply flooding it to all switches as BUM.
> 
> The idea is to reuse the existing MDB interface for the VXLAN driver in
> a similar way to how the FDB interface is shared between the bridge and
> VXLAN drivers.
> 
> From command line perspective, configuration will look as follows:
> 
>  # bridge mdb add dev br0 port vxlan0 grp 239.1.1.1 permanent \
> 	filter_mode exclude source_list 198.50.100.1,198.50.100.2
> 
>  # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \
> 	filter_mode include source_list 198.50.100.3,198.50.100.4 \
> 	dst 192.0.2.1 dst_port 4789 src_vni 2
> 
>  # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \
> 	filter_mode exclude source_list 198.50.100.1,198.50.100.2 \
> 	dst 192.0.2.2 dst_port 4789 src_vni 2
> 
> Where the first command is enabled by this set, but the next two will be
> the subject of future work.
> 
> From netlink perspective, the existing PF_BRIDGE/RTM_*MDB messages will
> be extended to the VXLAN driver. This means that a few new attributes
> will be added (e.g., 'MDBE_ATTR_SRC_VNI') and that the handlers for
> these messages will need to move to net/core/rtnetlink.c. The rtnetlink
> code will call into the appropriate driver based on the ifindex
> specified in the ancillary header.
> 
> iproute2 patches can be found here [9].
> 
> [1] https://datatracker.ietf.org/doc/html/rfc3376
> [2] https://www.rfc-editor.org/rfc/rfc3810
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6af52ae2ed14a6bc756d5606b29097dfd76740b8
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68d4fd30c83b1b208e08c954cd45e6474b148c87
> [5] https://datatracker.ietf.org/doc/html/rfc7432
> [6] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
> [7] https://datatracker.ietf.org/doc/html/rfc9251
> [8] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1
> [9] https://github.com/idosch/iproute2/commits/submit/mdb_rfc_v1
> 
> Ido Schimmel (19):
>   bridge: mcast: Centralize netlink attribute parsing
>   bridge: mcast: Remove redundant checks
>   bridge: mcast: Use MDB configuration structure where possible
>   bridge: mcast: Propagate MDB configuration structure further
>   bridge: mcast: Use MDB group key from configuration structure
>   bridge: mcast: Remove br_mdb_parse()
>   bridge: mcast: Move checks out of critical section
>   bridge: mcast: Remove redundant function arguments
>   bridge: mcast: Do not derive entry type from its filter mode
>   bridge: mcast: Split (*, G) and (S, G) addition into different
>     functions
>   bridge: mcast: Place netlink policy before validation functions
>   bridge: mcast: Add a centralized error path
>   bridge: mcast: Expose br_multicast_new_group_src()
>   bridge: mcast: Add a flag for user installed source entries
>   bridge: mcast: Avoid arming group timer when (S, G) corresponds to a
>     source
>   bridge: mcast: Add support for (*, G) with a source list and filter
>     mode
>   bridge: mcast: Allow user space to add (*, G) with a source list and
>     filter mode
>   bridge: mcast: Allow user space to specify MDB entry routing protocol
>   bridge: mcast: Support replacement of MDB port group entries
> 
>  include/uapi/linux/if_bridge.h |  21 +
>  net/bridge/br_mdb.c            | 821 ++++++++++++++++++++++++---------
>  net/bridge/br_multicast.c      |   5 +-
>  net/bridge/br_private.h        |  21 +
>  4 files changed, 649 insertions(+), 219 deletions(-)
> 

Good work, the set looks good to me. It's a natural extension to allow user-space to
manipulate such mcast groups. As we discussed privately it's only missing the selftests. :)

Cheers,
 Nik




[Index of Archives]     [Netdev]     [AoE Tools]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]     [Video 4 Linux]

  Powered by Linux