Re: [PATCH net-next v2] bond: add mac filter option for balance-xor

Jonathan Toppins <jtoppins@xxxxxxxxxx> · Fri, 13 May 2022 11:42:24 -0400

Hi Nik, thanks for the review. Responses below.

On 5/5/22 08:14, Nikolay Aleksandrov wrote:
On 04/05/2022 21:47, Jonathan Toppins wrote:
Implement a MAC filter that prevents duplicate frame delivery when
handling BUM traffic. This attempts to partially replicate OvS SLB
Bonding[1] like functionality without requiring significant change
in the Linux bridging code.

A typical network setup for this feature would be:

             .--------------------------------------------.
             |         .--------------------.             |
             |         |                    |             |
        .-------------------.               |             |
        |    | Bond 0  |    |               |             |
        | .--'---. .---'--. |               |             |
   .----|-| eth0 |-| eth1 |-|----.    .-----+----.   .----+------.
   |    | '------' '------' |    |    | Switch 1 |   | Switch 2  |
   |    '---,---------------'    |    |          +---+           |
   |       /                     |    '----+-----'   '----+------'
   |  .---'---.    .------.      |         |              |
   |  |  br0  |----| VM 1 |      |      ~~~~~~~~~~~~~~~~~~~~~
   |  '-------'    '------'      |     (                     )
   |      |        .------.      |     ( Rest of Network     )
   |      '--------| VM # |      |     (_____________________)
   |               '------'      |
   |  Host 1                     |
   '-----------------------------'

Where 'VM1' and 'VM#' are hosts connected to a Linux bridge, br0, with
bond0 and its associated links, eth0 & eth1, provide ingress/egress. One
can assume bond0, br1, and hosts VM1 to VM# are all contained in a
single box, as depicted. Interfaces eth0 and eth1 provide redundant
connections to the data center with the requirement to use all bandwidth
when the system is functioning normally. Switch 1 and Switch 2 are
physical switches that do not implement any advanced L2 management
features such as MLAG, Cisco's VPC, or LACP.

Combining this feature with vlan+srcmac hash policy allows a user to
create an access network without the need to use expensive switches that
support features like Cisco's VCP.

[1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding

Co-developed-by: Long Xin <lxin@xxxxxxxxxx>
Signed-off-by: Long Xin <lxin@xxxxxxxxxx>
Signed-off-by: Jonathan Toppins <jtoppins@xxxxxxxxxx>
---

Notes:
     v2:
      * dropped needless abstraction functions and put code in module init
      * renamed variable "rc" to "ret" to stay consistent with most of the
        code
      * fixed parameter setting management, when arp-monitor is turned on
        this feature will be turned off similar to how miimon and arp-monitor
        interact
      * renamed bond_xor_recv to bond_mac_filter_recv for a little more
        clarity
      * it appears the implied default return code for any bonding recv probe
        must be `RX_HANDLER_ANOTHER`. Changed the default return code of
        bond_mac_filter_recv to use this return value to not break skb
        processing when the skb dev is switched to the bond dev:
          `skb->dev = bond->dev`

  Documentation/networking/bonding.rst  |  19 +++
  drivers/net/bonding/Makefile          |   2 +-
  drivers/net/bonding/bond_mac_filter.c | 201 ++++++++++++++++++++++++++
  drivers/net/bonding/bond_mac_filter.h |  39 +++++
  drivers/net/bonding/bond_main.c       |  27 ++++
  drivers/net/bonding/bond_netlink.c    |  13 ++
  drivers/net/bonding/bond_options.c    |  86 ++++++++++-
  drivers/net/bonding/bonding_priv.h    |   1 +
  include/net/bond_options.h            |   1 +
  include/net/bonding.h                 |   3 +
  include/uapi/linux/if_link.h          |   1 +
  11 files changed, 390 insertions(+), 3 deletions(-)
  create mode 100644 drivers/net/bonding/bond_mac_filter.c
  create mode 100644 drivers/net/bonding/bond_mac_filter.h


Hi Jonathan,
I must mention that this is easily solvable with two very simple ebpf programs, one on egress
to track source macs and one on ingress to filter them, it can also easily be solved by a
user-space agent that adds macs for filtering in many different ways, after all these VMs
run on the host and you don't need bond-specific knowledge to do this. Also you have no visibility
into what is currently being filtered, so it will be difficult to debug. With the above solutions
you already have that. I don't think the bond should be doing any learning or filtering, this is
deviating a lot from its purpose and adds unnecessary complexity.
That being said, if you decide to continue with the set, comments are below...

This is an excellent observation, it does appear this could likely be 
done with eBPF. However, the delivery of such a solution to a user would 
be the difficult part. There appears to be no standard way for attaching 
a program to an interface, it still seems customary to write your own 
custom loader. Where would the user run this loader? In Debian likely in 
a post up hook with ifupdown, in Fedora one would have to write a 
locally custom dispatcher script (assuming Network Manager) that only 
ran the loader for a given interface. In short I do not see a reasonably 
appropriate way to deploy an eBPF program to users with the current 
infrastructure. Also, I am not aware of the bpf syscall supporting 
signed program loading. Signing kernel modules seems popular with some 
distros to identify limits of support and authentication of an 
unmodified system. I suspect similar bpf support might be needed to 
identify support and authentication for deployed programs.

[...]

diff --git a/drivers/net/bonding/bond_mac_filter.c b/drivers/net/bonding/bond_mac_filter.c
new file mode 100644
index 000000000000..e86b2b475df3
--- /dev/null
+++ b/drivers/net/bonding/bond_mac_filter.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Filter received frames based on MAC addresses "behind" the bond.
+ */
+
+#include "bonding_priv.h"
+
+static const struct rhashtable_params bond_rht_params = {
+	.head_offset         = offsetof(struct bond_mac_cache_entry, rhnode),
+	.key_offset          = offsetof(struct bond_mac_cache_entry, key),
+	.key_len             = sizeof(struct mac_addr),
+	.automatic_shrinking = true,
+};
+
+static inline unsigned long hold_time(const struct bonding *bond)

no inlines in .c files, let the compiler do its job

+{
+	return msecs_to_jiffies(5000);
+}
+
+static bool has_expired(const struct bonding *bond,
+			struct bond_mac_cache_entry *mac)
+{
+	return time_before_eq(mac->used + hold_time(bond), jiffies);
+}
+
+static void mac_delete_rcu(struct callback_head *head)
+{
+	kmem_cache_free(bond_mac_cache,
+			container_of(head, struct bond_mac_cache_entry, rcu));
+}
+
+static int mac_delete(struct bonding *bond,
+		      struct bond_mac_cache_entry *entry)
+{
+	int ret;
+
+	ret = rhashtable_remove_fast(bond->mac_filter_tbl,
+				     &entry->rhnode,
+				     bond->mac_filter_tbl->p);
+	set_bit(BOND_MAC_DEAD, &entry->flags);

you don't need the atomic bitops, these flags are all modified and checked
under the entry lock

I need to keep the atomic set_bit if I remove the [use-after-free] 
idiomatic issue later in the file.


+	call_rcu(&entry->rcu, mac_delete_rcu);

all of these entries are queued to be freed, what happens if we unload the bonding
driver before that?

[...]


+
+	rhashtable_walk_enter(bond->mac_filter_tbl, &iter);
+	rhashtable_walk_start(&iter);
+	while ((entry = rhashtable_walk_next(&iter)) != NULL) {
+		if (IS_ERR(entry))
+			continue;
+
+		spin_lock_irqsave(&entry->lock, flags);
+		if (has_expired(bond, entry))
+			mac_delete(bond, entry);
+		spin_unlock_irqrestore(&entry->lock, flags);

deleting entries while holding their own lock is not very idiomatic

[use-after-free] To fix this I made has_expired take the lock, making 
has_expired atomic. Now there is no need to have the critical section 
above and mac_delete can be outside the critical section. This also 
removed the use-after-free bug that would appear if the code were not 
using RCU and cache malloc.


+	bond->mac_filter_tbl = kzalloc(sizeof(*bond->mac_filter_tbl),
+				       GFP_KERNEL);
+	if (!bond->mac_filter_tbl)
+		return -ENOMEM;
+
+	ret = rhashtable_init(bond->mac_filter_tbl, &bond_rht_params);
+	if (ret)
+		kfree(bond->mac_filter_tbl);

on error this is freed, but the pointer is stale and on bond destruction
will be accessed and potentially freed again

set to NULL.

[...]

+static int mac_create(struct bonding *bond, const u8 *addr)
+{
+	struct bond_mac_cache_entry *entry;
+	int ret;
+
+	entry = kmem_cache_alloc(bond_mac_cache, GFP_ATOMIC);
+	if (!entry)
+		return -ENOMEM;
+	spin_lock_init(&entry->lock);
+	memcpy(&entry->key, addr, sizeof(entry->key));
+	entry->used = jiffies;

you must zero the old fields, otherwise you can find stale values from old
structs which were freed

good point, have done.

[...]

diff --git a/drivers/net/bonding/bond_mac_filter.h b/drivers/net/bonding/bond_mac_filter.h
new file mode 100644
index 000000000000..7c968d41b456
--- /dev/null
+++ b/drivers/net/bonding/bond_mac_filter.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * Filter received frames based on MAC addresses "behind" the bond.
+ */
+
+#ifndef _BOND_MAC_FILTER_H
+#define _BOND_MAC_FILTER_H
+#include <net/bonding.h>
+#include <linux/spinlock.h>
+#include <linux/rhashtable.h>
+
+enum {
+	BOND_MAC_DEAD,
+	BOND_MAC_LOCKED,
+	BOND_MAC_STATIC,

What are BOND_MAC_LOCKED or STATIC ? I didn't see them used anywhere.

Stale, was going to use them to allow the user to manually add entries 
but never got around to it. Removed.

[...]

diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 64f7db2627ce..d295903a525b 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c

[...]

@@ -1035,6 +1075,44 @@ static int bond_option_use_carrier_set(struct bonding *bond,
  	return 0;
  }
  
+static int bond_option_mac_filter_set(struct bonding *bond,
+				      const struct bond_opt_value *newval)
+{
+	int rc = 0;
+	u8 prev = bond->params.mac_filter;

reverse xmas tree

+
+	if (newval->value && bond->params.arp_interval) {

what happens if we set arp_interval after enabling this, the table will be
freed while the bond is up and is using it, also the queued work is still enabled

This is a good observation. To simplify the option setting I moved the 
init/destroy of the hash table to bond_open/close respectively. This 
allowed me to simply set the value of mac_filter. The only catch is in 
bond_option_arp_interval_set() if mac_filter is set and the interface is 
up, the user will receive an -EBUSY. This was the minimal amount of 
configuration behavioral change I could think of.

Thanks,
-Jon