Re: [PATCH net-next v2] bond: add mac filter option for balance-xor

Nikolay Aleksandrov <razor@xxxxxxxxxxxxx> · Fri, 13 May 2022 20:35:26 +0300

On 13/05/2022 19:59, Jay Vosburgh wrote:
> Nikolay Aleksandrov <razor@xxxxxxxxxxxxx> wrote:
> 
>> On 13/05/2022 18:42, Jonathan Toppins wrote:
>>> Hi Nik, thanks for the review. Responses below.
>>>
>>> On 5/5/22 08:14, Nikolay Aleksandrov wrote:
>>>> On 04/05/2022 21:47, Jonathan Toppins wrote:
>>>>> Implement a MAC filter that prevents duplicate frame delivery when
>>>>> handling BUM traffic. This attempts to partially replicate OvS SLB
>>>>> Bonding[1] like functionality without requiring significant change
>>>>> in the Linux bridging code.
>>>>>
>>>>> A typical network setup for this feature would be:
>>>>>
>>>>>              .--------------------------------------------.
>>>>>              |         .--------------------.             |
>>>>>              |         |                    |             |
>>>>>         .-------------------.               |             |
>>>>>         |    | Bond 0  |    |               |             |
>>>>>         | .--'---. .---'--. |               |             |
>>>>>    .----|-| eth0 |-| eth1 |-|----.    .-----+----.   .----+------.
>>>>>    |    | '------' '------' |    |    | Switch 1 |   | Switch 2  |
>>>>>    |    '---,---------------'    |    |          +---+           |
>>>>>    |       /                     |    '----+-----'   '----+------'
>>>>>    |  .---'---.    .------.      |         |              |
>>>>>    |  |  br0  |----| VM 1 |      |      ~~~~~~~~~~~~~~~~~~~~~
>>>>>    |  '-------'    '------'      |     (                     )
>>>>>    |      |        .------.      |     ( Rest of Network     )
>>>>>    |      '--------| VM # |      |     (_____________________)
>>>>>    |               '------'      |
>>>>>    |  Host 1                     |
>>>>>    '-----------------------------'
>>>>>
>>>>> Where 'VM1' and 'VM#' are hosts connected to a Linux bridge, br0, with
>>>>> bond0 and its associated links, eth0 & eth1, provide ingress/egress. One
>>>>> can assume bond0, br1, and hosts VM1 to VM# are all contained in a
>>>>> single box, as depicted. Interfaces eth0 and eth1 provide redundant
>>>>> connections to the data center with the requirement to use all bandwidth
>>>>> when the system is functioning normally. Switch 1 and Switch 2 are
>>>>> physical switches that do not implement any advanced L2 management
>>>>> features such as MLAG, Cisco's VPC, or LACP.
>>>>>
>>>>> Combining this feature with vlan+srcmac hash policy allows a user to
>>>>> create an access network without the need to use expensive switches that
>>>>> support features like Cisco's VCP.
>>>>>
>>>>> [1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding
>>>>>
>>>>> Co-developed-by: Long Xin <lxin@xxxxxxxxxx>
>>>>> Signed-off-by: Long Xin <lxin@xxxxxxxxxx>
>>>>> Signed-off-by: Jonathan Toppins <jtoppins@xxxxxxxxxx>
>>>>> ---
>>>>>
>>>>> Notes:
>>>>>      v2:
>>>>>       * dropped needless abstraction functions and put code in module init
>>>>>       * renamed variable "rc" to "ret" to stay consistent with most of the
>>>>>         code
>>>>>       * fixed parameter setting management, when arp-monitor is turned on
>>>>>         this feature will be turned off similar to how miimon and arp-monitor
>>>>>         interact
>>>>>       * renamed bond_xor_recv to bond_mac_filter_recv for a little more
>>>>>         clarity
>>>>>       * it appears the implied default return code for any bonding recv probe
>>>>>         must be `RX_HANDLER_ANOTHER`. Changed the default return code of
>>>>>         bond_mac_filter_recv to use this return value to not break skb
>>>>>         processing when the skb dev is switched to the bond dev:
>>>>>           `skb->dev = bond->dev`
>>>>>
>>>>>   Documentation/networking/bonding.rst  |  19 +++
>>>>>   drivers/net/bonding/Makefile          |   2 +-
>>>>>   drivers/net/bonding/bond_mac_filter.c | 201 ++++++++++++++++++++++++++
>>>>>   drivers/net/bonding/bond_mac_filter.h |  39 +++++
>>>>>   drivers/net/bonding/bond_main.c       |  27 ++++
>>>>>   drivers/net/bonding/bond_netlink.c    |  13 ++
>>>>>   drivers/net/bonding/bond_options.c    |  86 ++++++++++-
>>>>>   drivers/net/bonding/bonding_priv.h    |   1 +
>>>>>   include/net/bond_options.h            |   1 +
>>>>>   include/net/bonding.h                 |   3 +
>>>>>   include/uapi/linux/if_link.h          |   1 +
>>>>>   11 files changed, 390 insertions(+), 3 deletions(-)
>>>>>   create mode 100644 drivers/net/bonding/bond_mac_filter.c
>>>>>   create mode 100644 drivers/net/bonding/bond_mac_filter.h
>>>>>
>>>>
>>>> Hi Jonathan,
>>>> I must mention that this is easily solvable with two very simple ebpf programs, one on egress
>>>> to track source macs and one on ingress to filter them, it can also easily be solved by a
>>>> user-space agent that adds macs for filtering in many different ways, after all these VMs
>>>> run on the host and you don't need bond-specific knowledge to do this. Also you have no visibility
>>>> into what is currently being filtered, so it will be difficult to debug. With the above solutions
>>>> you already have that. I don't think the bond should be doing any learning or filtering, this is
>>>> deviating a lot from its purpose and adds unnecessary complexity.
>>>> That being said, if you decide to continue with the set, comments are below...
>>>
>>> This is an excellent observation, it does appear this could likely be done with eBPF. However, the delivery of such a solution to a user would be the difficult part. There appears to be no standard way for attaching a program to an interface, it still seems customary to write your own custom loader. Where would the user run this loader? In Debian likely in a post up hook with ifupdown, in Fedora one would have to write a locally custom dispatcher script (assuming Network Manager) that only ran the loader for a given interface. In short I do not see a reasonably appropriate way to deploy an eBPF program to users with the current infrastructure. Also, I am not aware of the bpf syscall supporting signed program loading. Signing kernel modules seems popular with some distros to identify limits of support and authentication of an unmodified system. I suspect similar bpf support might be needed to identify support and authentication for deployed programs.
>>>
>>
>> A great deal of the distributions (almost all major ones) out there
>> already use eBPF for various tasks, so I can't see how any of these
>> arguments apply nowadays. There are standard ways to load eBPF programs
>> that have been around for quite some time and most of the different
>> software needed to achieve that is already packaged for all major
>> distributions (and has been for a long time). Anyway getting into the
>> details of "how" the user would load the program is not really
>> pertinent to the discussion, that doesn't warrant adding so much new
>> complexity in the bonding driver which will have to be maintained
>> forever.
> 
> 	While I agree in principle that various bonding things could and
> perhaps should be done with eBPF, I think the "how" is pertinent; you
> (Nik) say that there are "standard ways" that are "packaged for all
> major distributions"; what exactly are those?
> 

All distributions have iproute2 packages (tc can be used to load the programs).
Many of them already have bpftool as well.

>> [...] Honestly, I don't like the idea of adding learning to the
>> bonding at all, I think it's the wrong place for it, especially when
>> the solution can easily be achieved with already available means.  It
>> might not even be eBPF, you can do it with a user-space agent that uses
>> nftables or some other filtering mechanism, I'm sure you can think of
>> many other ways to solve it which don't require this new
>> infrastructure. All of these ways to solve it have many advantages over
>> this (e.g. visibility into the current entries being filtered, control
>> over them and so on).
> 
> 	What user-space agent would that be?  You're proposing that
> someone would need to create the infrastructure for this themselves,
> correct?
> 

I meant it as a new solution, not an existing one. The problem could easily
be solved either with a new user-space agent that adds those filtering entries dynamically,
or through an eBPF program loaded on the interface. Point was that it doesn't require
anything bonding-specific to solve it or even anything new to be added to the kernel which
will have to be maintained forever. Also, as I already mentioned, there is no visibility or
control (except on/off) over the entries which you'll get "for free" with any of the other
solutions.

> 	I'm not really a huge fan of adding random functionality to
> bonding that is a copy of something easily available elsewhere.  This is
> basically a feature copy from OVS, and my concern is more along the
> lines of this ending up like the alb mode, which is overly complex and
> usually not the best choice.  I'll caveat that by saying that I've not
> studied the implementation here in detail, and will look for the v3 for
> review.
> 
> 	-J
> 
>> That's my opinion of course, it'd be nice to get feedback from others as well.
>>
>> Cheers,
>> Nik
>>
>>> [...]
>>>
>>>>> diff --git a/drivers/net/bonding/bond_mac_filter.c b/drivers/net/bonding/bond_mac_filter.c
>>>>> new file mode 100644
>>>>> index 000000000000..e86b2b475df3
>>>>> --- /dev/null
>>>>> +++ b/drivers/net/bonding/bond_mac_filter.c
>>>>> @@ -0,0 +1,201 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>>> +/*
>>>>> + * Filter received frames based on MAC addresses "behind" the bond.
>>>>> + */
>>>>> +
>>>>> +#include "bonding_priv.h"
>>>>> +
>>>>> +static const struct rhashtable_params bond_rht_params = {
>>>>> +    .head_offset         = offsetof(struct bond_mac_cache_entry, rhnode),
>>>>> +    .key_offset          = offsetof(struct bond_mac_cache_entry, key),
>>>>> +    .key_len             = sizeof(struct mac_addr),
>>>>> +    .automatic_shrinking = true,
>>>>> +};
>>>>> +
>>>>> +static inline unsigned long hold_time(const struct bonding *bond)
>>>>
>>>> no inlines in .c files, let the compiler do its job
>>>>
>>>>> +{
>>>>> +    return msecs_to_jiffies(5000);
>>>>> +}
>>>>> +
>>>>> +static bool has_expired(const struct bonding *bond,
>>>>> +            struct bond_mac_cache_entry *mac)
>>>>> +{
>>>>> +    return time_before_eq(mac->used + hold_time(bond), jiffies);
>>>>> +}
>>>>> +
>>>>> +static void mac_delete_rcu(struct callback_head *head)
>>>>> +{
>>>>> +    kmem_cache_free(bond_mac_cache,
>>>>> +            container_of(head, struct bond_mac_cache_entry, rcu));
>>>>> +}
>>>>> +
>>>>> +static int mac_delete(struct bonding *bond,
>>>>> +              struct bond_mac_cache_entry *entry)
>>>>> +{
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = rhashtable_remove_fast(bond->mac_filter_tbl,
>>>>> +                     &entry->rhnode,
>>>>> +                     bond->mac_filter_tbl->p);
>>>>> +    set_bit(BOND_MAC_DEAD, &entry->flags);
>>>>
>>>> you don't need the atomic bitops, these flags are all modified and checked
>>>> under the entry lock
>>>
>>> I need to keep the atomic set_bit if I remove the [use-after-free] idiomatic issue later in the file.
>>>
>>>>
>>>>> +    call_rcu(&entry->rcu, mac_delete_rcu);
>>>>
>>>> all of these entries are queued to be freed, what happens if we unload the bonding
>>>> driver before that?
>>>
>>> [...]
>>>
>>>>
>>>>> +
>>>>> +    rhashtable_walk_enter(bond->mac_filter_tbl, &iter);
>>>>> +    rhashtable_walk_start(&iter);
>>>>> +    while ((entry = rhashtable_walk_next(&iter)) != NULL) {
>>>>> +        if (IS_ERR(entry))
>>>>> +            continue;
>>>>> +
>>>>> +        spin_lock_irqsave(&entry->lock, flags);
>>>>> +        if (has_expired(bond, entry))
>>>>> +            mac_delete(bond, entry);
>>>>> +        spin_unlock_irqrestore(&entry->lock, flags);
>>>>
>>>> deleting entries while holding their own lock is not very idiomatic
>>>
>>> [use-after-free] To fix this I made has_expired take the lock, making has_expired atomic. Now there is no need to have the critical section above and mac_delete can be outside the critical section. This also removed the use-after-free bug that would appear if the code were not using RCU and cache malloc.
>>>
>>>>
>>>>> +    bond->mac_filter_tbl = kzalloc(sizeof(*bond->mac_filter_tbl),
>>>>> +                       GFP_KERNEL);
>>>>> +    if (!bond->mac_filter_tbl)
>>>>> +        return -ENOMEM;
>>>>> +
>>>>> +    ret = rhashtable_init(bond->mac_filter_tbl, &bond_rht_params);
>>>>> +    if (ret)
>>>>> +        kfree(bond->mac_filter_tbl);
>>>>
>>>> on error this is freed, but the pointer is stale and on bond destruction
>>>> will be accessed and potentially freed again
>>>
>>> set to NULL.
>>>
>>> [...]
>>>
>>>>> +static int mac_create(struct bonding *bond, const u8 *addr)
>>>>> +{
>>>>> +    struct bond_mac_cache_entry *entry;
>>>>> +    int ret;
>>>>> +
>>>>> +    entry = kmem_cache_alloc(bond_mac_cache, GFP_ATOMIC);
>>>>> +    if (!entry)
>>>>> +        return -ENOMEM;
>>>>> +    spin_lock_init(&entry->lock);
>>>>> +    memcpy(&entry->key, addr, sizeof(entry->key));
>>>>> +    entry->used = jiffies;
>>>>
>>>> you must zero the old fields, otherwise you can find stale values from old
>>>> structs which were freed
>>>
>>> good point, have done.
>>>
>>> [...]
>>>
>>>>> diff --git a/drivers/net/bonding/bond_mac_filter.h b/drivers/net/bonding/bond_mac_filter.h
>>>>> new file mode 100644
>>>>> index 000000000000..7c968d41b456
>>>>> --- /dev/null
>>>>> +++ b/drivers/net/bonding/bond_mac_filter.h
>>>>> @@ -0,0 +1,39 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0-only
>>>>> + *
>>>>> + * Filter received frames based on MAC addresses "behind" the bond.
>>>>> + */
>>>>> +
>>>>> +#ifndef _BOND_MAC_FILTER_H
>>>>> +#define _BOND_MAC_FILTER_H
>>>>> +#include <net/bonding.h>
>>>>> +#include <linux/spinlock.h>
>>>>> +#include <linux/rhashtable.h>
>>>>> +
>>>>> +enum {
>>>>> +    BOND_MAC_DEAD,
>>>>> +    BOND_MAC_LOCKED,
>>>>> +    BOND_MAC_STATIC,
>>>>
>>>> What are BOND_MAC_LOCKED or STATIC ? I didn't see them used anywhere.
>>>
>>> Stale, was going to use them to allow the user to manually add entries but never got around to it. Removed.
>>>
>>> [...]
>>>
>>>>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>>>>> index 64f7db2627ce..d295903a525b 100644
>>>>> --- a/drivers/net/bonding/bond_options.c
>>>>> +++ b/drivers/net/bonding/bond_options.c
>>>
>>> [...]
>>>
>>>>> @@ -1035,6 +1075,44 @@ static int bond_option_use_carrier_set(struct bonding *bond,
>>>>>       return 0;
>>>>>   }
>>>>>   +static int bond_option_mac_filter_set(struct bonding *bond,
>>>>> +                      const struct bond_opt_value *newval)
>>>>> +{
>>>>> +    int rc = 0;
>>>>> +    u8 prev = bond->params.mac_filter;
>>>>
>>>> reverse xmas tree
>>>>
>>>>> +
>>>>> +    if (newval->value && bond->params.arp_interval) {
>>>>
>>>> what happens if we set arp_interval after enabling this, the table will be
>>>> freed while the bond is up and is using it, also the queued work is still enabled
>>>
>>> This is a good observation. To simplify the option setting I moved the init/destroy of the hash table to bond_open/close respectively. This allowed me to simply set the value of mac_filter. The only catch is in bond_option_arp_interval_set() if mac_filter is set and the interface is up, the user will receive an -EBUSY. This was the minimal amount of configuration behavioral change I could think of.
>>>
>>> Thanks,
>>> -Jon
>>>
>>
> 
> ---
> 	-Jay Vosburgh, jay.vosburgh@xxxxxxxxxxxxx