Re: [PATCH bpf-next 2/5] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 17, 2022 at 1:10 PM Yonghong Song <yhs@xxxxxxxx> wrote:
>
>
>
> On 10/17/22 11:25 AM, Yosry Ahmed wrote:
> > On Mon, Oct 17, 2022 at 11:02 AM <sdf@xxxxxxxxxx> wrote:
> >>
> >> On 10/13, Yonghong Song wrote:
> >>> Similar to sk/inode/task storage, implement similar cgroup local storage.
> >>
> >>> There already exists a local storage implementation for cgroup-attached
> >>> bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
> >>> bpf_get_local_storage(). But there are use cases such that non-cgroup
> >>> attached bpf progs wants to access cgroup local storage data. For example,
> >>> tc egress prog has access to sk and cgroup. It is possible to use
> >>> sk local storage to emulate cgroup local storage by storing data in
> >>> socket.
> >>> But this is a waste as it could be lots of sockets belonging to a
> >>> particular
> >>> cgroup. Alternatively, a separate map can be created with cgroup id as
> >>> the key.
> >>> But this will introduce additional overhead to manipulate the new map.
> >>> A cgroup local storage, similar to existing sk/inode/task storage,
> >>> should help for this use case.
> >>
> >>> The life-cycle of storage is managed with the life-cycle of the
> >>> cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
> >>> with a callback to the bpf_cgroup_storage_free when cgroup itself
> >>> is deleted.
> >>
> >>> The userspace map operations can be done by using a cgroup fd as a key
> >>> passed to the lookup, update and delete operations.
> >>
> >>
> >> [..]
> >>
> >>> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup
> >>> local
> >>> storage support, the new map name BPF_MAP_TYPE_CGROUP_LOCAL_STORAGE is
> >>> used
> >>> for cgroup storage available to non-cgroup-attached bpf programs. The two
> >>> helpers are named as bpf_cgroup_local_storage_get() and
> >>> bpf_cgroup_local_storage_delete().
> >>
> >> Have you considered doing something similar to 7d9c3427894f ("bpf: Make
> >> cgroup storages shared between programs on the same cgroup") where
> >> the map changes its behavior depending on the key size (see key_size checks
> >> in cgroup_storage_map_alloc)? Looks like sizeof(int) for fd still
> >> can be used so we can, in theory, reuse the name..
> >>
> >> Pros:
> >> - no need for a new map name
> >>
> >> Cons:
> >> - existing BPF_MAP_TYPE_CGROUP_STORAGE is already messy; might be not a
> >>     good idea to add more stuff to it?
> >>
> >> But, for the very least, should we also extend
> >> Documentation/bpf/map_cgroup_storage.rst to cover the new map? We've
> >> tried to keep some of the important details in there..
> >
> > This might be a long shot, but is it possible to switch completely to
> > this new generic cgroup storage, and for programs that attach to
> > cgroups we can still do lookups/allocations during attachment like we
> > do today? IOW, maintain the current API for cgroup progs but switch it
> > to use this new map type instead.
>
> Right, cgroup attach/detach should not be impacted by this patch.
>
> >
> > It feels like this map type is more generic and can be a superset of
> > the existing cgroup storage, but I feel like I am missing something.
>
> One difference is old way cgroup local storage allocates the memory
> at map creation time, and the new way allocates the memory at runtime
> when get/update helper is called.
>

IIUC the old cgroup local storage allocates memory when a program is
attached. We can have the same behavior with the new map type, right?
When a program is attached to a cgroup, allocate the memory, otherwise
it is allocated at run time. Does this make sense?

> >
> >>
> >>> Signed-off-by: Yonghong Song <yhs@xxxxxx>
> >>> ---
> >>>    include/linux/bpf.h             |   3 +
> >>>    include/linux/bpf_types.h       |   1 +
> >>>    include/linux/cgroup-defs.h     |   4 +
> >>>    include/uapi/linux/bpf.h        |  39 +++++
> >>>    kernel/bpf/Makefile             |   2 +-
> >>>    kernel/bpf/bpf_cgroup_storage.c | 280 ++++++++++++++++++++++++++++++++
> >>>    kernel/bpf/helpers.c            |   6 +
> >>>    kernel/bpf/syscall.c            |   3 +-
> >>>    kernel/bpf/verifier.c           |  14 +-
> >>>    kernel/cgroup/cgroup.c          |   4 +
> >>>    kernel/trace/bpf_trace.c        |   4 +
> >>>    scripts/bpf_doc.py              |   2 +
> >>>    tools/include/uapi/linux/bpf.h  |  39 +++++
> >>>    13 files changed, 398 insertions(+), 3 deletions(-)
> >>>    create mode 100644 kernel/bpf/bpf_cgroup_storage.c
> >>
> [...]
> >>> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> >>> index 341c94f208f4..b02693f51978 100644
> >>> --- a/kernel/bpf/Makefile
> >>> +++ b/kernel/bpf/Makefile
> >>> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> >>>    obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> >>>    endif
> >>>    ifeq ($(CONFIG_CGROUPS),y)
> >>> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> >>> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgroup_storage.o
> >>>    endif
> >>>    obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> >>>    ifeq ($(CONFIG_INET),y)
> >>> diff --git a/kernel/bpf/bpf_cgroup_storage.c
> >>> b/kernel/bpf/bpf_cgroup_storage.c
> >>> new file mode 100644
> >>> index 000000000000..9974784822da
> >>> --- /dev/null
> >>> +++ b/kernel/bpf/bpf_cgroup_storage.c
> >>> @@ -0,0 +1,280 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>> +/*
> >>> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> >>> + */
> >>> +
> >>> +#include <linux/types.h>
> >>> +#include <linux/bpf.h>
> >>> +#include <linux/bpf_local_storage.h>
> >>> +#include <uapi/linux/btf.h>
> >>> +#include <linux/btf_ids.h>
> >>> +
> >>> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
> >>> +
> >>> +static DEFINE_PER_CPU(int, bpf_cgroup_storage_busy);
> >>> +
> >>> +static void bpf_cgroup_storage_lock(void)
> >>> +{
> >>> +     migrate_disable();
> >>> +     this_cpu_inc(bpf_cgroup_storage_busy);
> >>> +}
> >>> +
> >>> +static void bpf_cgroup_storage_unlock(void)
> >>> +{
> >>> +     this_cpu_dec(bpf_cgroup_storage_busy);
> >>> +     migrate_enable();
> >>> +}
> >>> +
> >>> +static bool bpf_cgroup_storage_trylock(void)
> >>> +{
> >>> +     migrate_disable();
> >>> +     if (unlikely(this_cpu_inc_return(bpf_cgroup_storage_busy) != 1)) {
> >>> +             this_cpu_dec(bpf_cgroup_storage_busy);
> >>> +             migrate_enable();
> >>> +             return false;
> >>> +     }
> >>> +     return true;
> >>> +}
> >>
> >> Task storage has lock/unlock/trylock; inode storage doesn't; why does
> >> cgroup need it as well?
>
> I think so. the new cgroup local storage might be used in fentry/fexit
> programs which could cause recursion.
>
> >>
> >>> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
> >>> +{
> >>> +     struct cgroup *cg = owner;
> >>> +
> >>> +     return &cg->bpf_cgroup_storage;
> >>> +}
> >>> +
> >>> +void bpf_local_cgroup_storage_free(struct cgroup *cgroup)
> >>> +{
> >>> +     struct bpf_local_storage *local_storage;
> >>> +     struct bpf_local_storage_elem *selem;
> >>> +     bool free_cgroup_storage = false;
> >>> +     struct hlist_node *n;
> >>> +     unsigned long flags;
> >>> +
> >>> +     rcu_read_lock();
> >>> +     local_storage = rcu_dereference(cgroup->bpf_cgroup_storage);
> >>> +     if (!local_storage) {
> >>> +             rcu_read_unlock();
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     /* Neither the bpf_prog nor the bpf-map's syscall
> >>> +      * could be modifying the local_storage->list now.
> >>> +      * Thus, no elem can be added-to or deleted-from the
> >>> +      * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> >>> +      *
> >>> +      * It is racing with bpf_local_storage_map_free() alone
> >>> +      * when unlinking elem from the local_storage->list and
> >>> +      * the map's bucket->list.
> >>> +      */
> >>> +     bpf_cgroup_storage_lock();
> >>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> >>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> >>> +             bpf_selem_unlink_map(selem);
> >>> +             free_cgroup_storage =
> >>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> >>> +     }
> >>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> >>> +     bpf_cgroup_storage_unlock();
> >>> +     rcu_read_unlock();
> >>> +
> >>> +     /* free_cgroup_storage should always be true as long as
> >>> +      * local_storage->list was non-empty.
> >>> +      */
> >>> +     if (free_cgroup_storage)
> >>> +             kfree_rcu(local_storage, rcu);
> >>> +}
> >>
> >>> +static struct bpf_local_storage_data *
> >>> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool
> >>> cacheit_lockit)
> >>> +{
> >>> +     struct bpf_local_storage *cgroup_storage;
> >>> +     struct bpf_local_storage_map *smap;
> >>> +
> >>> +     cgroup_storage = rcu_dereference_check(cgroup->bpf_cgroup_storage,
> >>> +                                            bpf_rcu_lock_held());
> >>> +     if (!cgroup_storage)
> >>> +             return NULL;
> >>> +
> >>> +     smap = (struct bpf_local_storage_map *)map;
> >>> +     return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> >>> +}
> >>> +
> >>> +static void *bpf_cgroup_storage_lookup_elem(struct bpf_map *map, void
> >>> *key)
> >>> +{
> >>> +     struct bpf_local_storage_data *sdata;
> >>> +     struct cgroup *cgroup;
> >>> +     int fd;
> >>> +
> >>> +     fd = *(int *)key;
> >>> +     cgroup = cgroup_get_from_fd(fd);
> >>> +     if (IS_ERR(cgroup))
> >>> +             return ERR_CAST(cgroup);
> >>> +
> >>> +     bpf_cgroup_storage_lock();
> >>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
> >>> +     bpf_cgroup_storage_unlock();
> >>> +     cgroup_put(cgroup);
> >>> +     return sdata ? sdata->data : NULL;
> >>> +}
> >>
> >> A lot of the above (free/lookup) seems to be copy-pasted from the task
> >> storage;
> >> any point in trying to generalize the common parts?
>
> That is true. Let me think about this.
>
> >>
> >>> +static int bpf_cgroup_storage_update_elem(struct bpf_map *map, void *key,
> >>> +                                       void *value, u64 map_flags)
> >>> +{
> >>> +     struct bpf_local_storage_data *sdata;
> >>> +     struct cgroup *cgroup;
> >>> +     int err, fd;
> >>> +
> >>> +     fd = *(int *)key;
> >>> +     cgroup = cgroup_get_from_fd(fd);
> >>> +     if (IS_ERR(cgroup))
> >>> +             return PTR_ERR(cgroup);
> >>> +
> >>> +     bpf_cgroup_storage_lock();
> >>> +     sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map
> >>> *)map,
> >>> +                                      value, map_flags, GFP_ATOMIC);
> >>> +     bpf_cgroup_storage_unlock();
> >>> +     err = PTR_ERR_OR_ZERO(sdata);
> >>> +     cgroup_put(cgroup);
> >>> +     return err;
> >>> +}
> >>> +
> [...]
> >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> >>> index 8ad2c267ff47..2fa2c950c7fb 100644
> >>> --- a/kernel/cgroup/cgroup.c
> >>> +++ b/kernel/cgroup/cgroup.c
> >>> @@ -985,6 +985,10 @@ void put_css_set_locked(struct css_set *cset)
> >>>                put_css_set_locked(cset->dom_cset);
> >>>        }
> >>
> >>> +#ifdef CONFIG_BPF_SYSCALL
> >>> +     bpf_local_cgroup_storage_free(cset->dfl_cgrp);
> >>> +#endif
> >>> +
> >
> > I am confused about this freeing site. It seems like this path is for
> > freeing css_set's of task_structs, not for freeing the cgroup itself.
> > Wouldn't we want to free the local storage when we free the cgroup
> > itself? Somewhere like css_free_rwork_fn()? or did I completely miss
> > the point here?
>
> Thanks for suggestions here. To be honest, I am not sure whether this
> location is correct or not. I will look at css_free_rwork_fn() which
> might be a good place.
>
> >
> >>>        kfree_rcu(cset, rcu_head);
> >>>    }
> >>
> [...]



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux