On Tue, Oct 25, 2022 at 9:11 PM Yonghong Song <yhs@xxxxxxxx> wrote: > > > > On 10/25/22 3:29 PM, Yosry Ahmed wrote: > > On Tue, Oct 25, 2022 at 2:54 PM Yonghong Song <yhs@xxxxxx> wrote: > >> > >> Similar to sk/inode/task storage, implement similar cgroup local storage. > >> > >> There already exists a local storage implementation for cgroup-attached > >> bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper > >> bpf_get_local_storage(). But there are use cases such that non-cgroup > >> attached bpf progs wants to access cgroup local storage data. For example, > >> tc egress prog has access to sk and cgroup. It is possible to use > >> sk local storage to emulate cgroup local storage by storing data in socket. > >> But this is a waste as it could be lots of sockets belonging to a particular > >> cgroup. Alternatively, a separate map can be created with cgroup id as the key. > >> But this will introduce additional overhead to manipulate the new map. > >> A cgroup local storage, similar to existing sk/inode/task storage, > >> should help for this use case. > >> > >> The life-cycle of storage is managed with the life-cycle of the > >> cgroup struct. i.e. the storage is destroyed along with the owning cgroup > >> with a call to bpf_cgrp_storage_free() when cgroup itself > >> is deleted. > >> > >> The userspace map operations can be done by using a cgroup fd as a key > >> passed to the lookup, update and delete operations. > >> > >> Typically, the following code is used to get the current cgroup: > >> struct task_struct *task = bpf_get_current_task_btf(); > >> ... task->cgroups->dfl_cgrp ... > >> and in structure task_struct definition: > >> struct task_struct { > >> .... > >> struct css_set __rcu *cgroups; > >> .... > >> } > >> With sleepable program, accessing task->cgroups is not protected by rcu_read_lock. > >> So the current implementation only supports non-sleepable program and supporting > >> sleepable program will be the next step together with adding rcu_read_lock > >> protection for rcu tagged structures. > >> > >> Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local > >> storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used > >> for cgroup storage available to non-cgroup-attached bpf programs. The old > >> cgroup storage supports bpf_get_local_storage() helper to get the cgroup data. > >> The new cgroup storage helper bpf_cgrp_storage_get() can provide similar > >> functionality. While old cgroup storage pre-allocates storage memory, the new > >> mechanism can also pre-allocate with a user space bpf_map_update_elem() call > >> to avoid potential run-time memory allocation failure. > >> Therefore, the new cgroup storage can provide all functionality w.r.t. > >> the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to > >> BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can > >> be deprecated since the new one can provide the same functionality. > >> > >> Acked-by: David Vernet <void@xxxxxxxxxxxxx> > >> Signed-off-by: Yonghong Song <yhs@xxxxxx> > >> --- > >> include/linux/bpf.h | 7 + > >> include/linux/bpf_types.h | 1 + > >> include/linux/cgroup-defs.h | 4 + > >> include/uapi/linux/bpf.h | 50 ++++++- > >> kernel/bpf/Makefile | 2 +- > >> kernel/bpf/bpf_cgrp_storage.c | 247 +++++++++++++++++++++++++++++++++ > >> kernel/bpf/helpers.c | 6 + > >> kernel/bpf/syscall.c | 3 +- > >> kernel/bpf/verifier.c | 13 +- > >> kernel/cgroup/cgroup.c | 1 + > >> kernel/trace/bpf_trace.c | 4 + > >> scripts/bpf_doc.py | 2 + > >> tools/include/uapi/linux/bpf.h | 50 ++++++- > >> 13 files changed, 385 insertions(+), 5 deletions(-) > >> create mode 100644 kernel/bpf/bpf_cgrp_storage.c > >> > >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h > >> index 9e7d46d16032..0fa3b4f6e777 100644 > >> --- a/include/linux/bpf.h > >> +++ b/include/linux/bpf.h > >> @@ -2045,6 +2045,7 @@ struct bpf_link *bpf_link_by_id(u32 id); > >> > >> const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id); > >> void bpf_task_storage_free(struct task_struct *task); > >> +void bpf_cgrp_storage_free(struct cgroup *cgroup); > >> bool bpf_prog_has_kfunc_call(const struct bpf_prog *prog); > >> const struct btf_func_model * > >> bpf_jit_find_kfunc_model(const struct bpf_prog *prog, > >> @@ -2299,6 +2300,10 @@ static inline bool has_current_bpf_ctx(void) > >> static inline void bpf_prog_inc_misses_counter(struct bpf_prog *prog) > >> { > >> } > >> + > >> +static inline void bpf_cgrp_storage_free(struct cgroup *cgroup) > >> +{ > >> +} > >> #endif /* CONFIG_BPF_SYSCALL */ > >> > >> void __bpf_free_used_btfs(struct bpf_prog_aux *aux, > >> @@ -2537,6 +2542,8 @@ extern const struct bpf_func_proto bpf_copy_from_user_task_proto; > >> extern const struct bpf_func_proto bpf_set_retval_proto; > >> extern const struct bpf_func_proto bpf_get_retval_proto; > >> extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto; > >> +extern const struct bpf_func_proto bpf_cgrp_storage_get_proto; > >> +extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto; > >> > >> const struct bpf_func_proto *tracing_prog_func_proto( > >> enum bpf_func_id func_id, const struct bpf_prog *prog); > >> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h > >> index 2c6a4f2562a7..d4ee3ccd3753 100644 > >> --- a/include/linux/bpf_types.h > >> +++ b/include/linux/bpf_types.h > >> @@ -86,6 +86,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_PROG_ARRAY, prog_array_map_ops) > >> BPF_MAP_TYPE(BPF_MAP_TYPE_PERF_EVENT_ARRAY, perf_event_array_map_ops) > >> #ifdef CONFIG_CGROUPS > >> BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops) > >> +BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops) > >> #endif > >> #ifdef CONFIG_CGROUP_BPF > >> BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) > >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > >> index 8f481d1b159a..c466fdc3a32a 100644 > >> --- a/include/linux/cgroup-defs.h > >> +++ b/include/linux/cgroup-defs.h > >> @@ -504,6 +504,10 @@ struct cgroup { > >> /* Used to store internal freezer state */ > >> struct cgroup_freezer_state freezer; > >> > >> +#ifdef CONFIG_BPF_SYSCALL > >> + struct bpf_local_storage __rcu *bpf_cgrp_storage; > >> +#endif > >> + > >> /* All ancestors including self */ > >> struct cgroup *ancestors[]; > >> }; > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > >> index 17f61338f8f8..94659f6b3395 100644 > >> --- a/include/uapi/linux/bpf.h > >> +++ b/include/uapi/linux/bpf.h > >> @@ -922,7 +922,14 @@ enum bpf_map_type { > >> BPF_MAP_TYPE_CPUMAP, > >> BPF_MAP_TYPE_XSKMAP, > >> BPF_MAP_TYPE_SOCKHASH, > >> - BPF_MAP_TYPE_CGROUP_STORAGE, > >> + BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, > >> + /* BPF_MAP_TYPE_CGROUP_STORAGE is available to bpf programs attaching > >> + * to a cgroup. The newer BPF_MAP_TYPE_CGRP_STORAGE is available to > >> + * both cgroup-attached and other progs and supports all functionality > >> + * provided by BPF_MAP_TYPE_CGROUP_STORAGE. So mark > >> + * BPF_MAP_TYPE_CGROUP_STORAGE deprecated. > >> + */ > >> + BPF_MAP_TYPE_CGROUP_STORAGE = BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED, > >> BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, > >> BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, > >> BPF_MAP_TYPE_QUEUE, > >> @@ -935,6 +942,7 @@ enum bpf_map_type { > >> BPF_MAP_TYPE_TASK_STORAGE, > >> BPF_MAP_TYPE_BLOOM_FILTER, > >> BPF_MAP_TYPE_USER_RINGBUF, > >> + BPF_MAP_TYPE_CGRP_STORAGE, > >> }; > >> > >> /* Note that tracing related programs such as > >> @@ -5435,6 +5443,44 @@ union bpf_attr { > >> * **-E2BIG** if user-space has tried to publish a sample which is > >> * larger than the size of the ring buffer, or which cannot fit > >> * within a struct bpf_dynptr. > >> + * > >> + * void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags) > >> + * Description > >> + * Get a bpf_local_storage from the *cgroup*. > >> + * > >> + * Logically, it could be thought of as getting the value from > >> + * a *map* with *cgroup* as the **key**. From this > >> + * perspective, the usage is not much different from > >> + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > >> + * helper enforces the key must be a cgroup struct and the map must also > >> + * be a **BPF_MAP_TYPE_CGRP_STORAGE**. > >> + * > >> + * In reality, the local-storage value is embedded directly inside of the > >> + * *cgroup* object itself, rather than being located in the > >> + * **BPF_MAP_TYPE_CGRP_STORAGE** map. When the local-storage value is > >> + * queried for some *map* on a *cgroup* object, the kernel will perform an > >> + * O(n) iteration over all of the live local-storage values for that > >> + * *cgroup* object until the local-storage value for the *map* is found. > >> + * > >> + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be > >> + * used such that a new bpf_local_storage will be > >> + * created if one does not exist. *value* can be used > >> + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify > >> + * the initial value of a bpf_local_storage. If *value* is > >> + * **NULL**, the new bpf_local_storage will be zero initialized. > >> + * Return > >> + * A bpf_local_storage pointer is returned on success. > >> + * > >> + * **NULL** if not found or there was an error in adding > >> + * a new bpf_local_storage. > >> + * > >> + * long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup) > >> + * Description > >> + * Delete a bpf_local_storage from a *cgroup*. > >> + * Return > >> + * 0 on success. > >> + * > >> + * **-ENOENT** if the bpf_local_storage cannot be found. > >> */ > >> #define ___BPF_FUNC_MAPPER(FN, ctx...) \ > >> FN(unspec, 0, ##ctx) \ > >> @@ -5647,6 +5693,8 @@ union bpf_attr { > >> FN(tcp_raw_check_syncookie_ipv6, 207, ##ctx) \ > >> FN(ktime_get_tai_ns, 208, ##ctx) \ > >> FN(user_ringbuf_drain, 209, ##ctx) \ > >> + FN(cgrp_storage_get, 210, ##ctx) \ > >> + FN(cgrp_storage_delete, 211, ##ctx) \ > >> /* */ > >> > >> /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't > >> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > >> index 341c94f208f4..3a12e6b400a2 100644 > >> --- a/kernel/bpf/Makefile > >> +++ b/kernel/bpf/Makefile > >> @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > >> obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > >> endif > >> ifeq ($(CONFIG_CGROUPS),y) > >> -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > >> +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o > >> endif > >> obj-$(CONFIG_CGROUP_BPF) += cgroup.o > >> ifeq ($(CONFIG_INET),y) > >> diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c > >> new file mode 100644 > >> index 000000000000..309403800f82 > >> --- /dev/null > >> +++ b/kernel/bpf/bpf_cgrp_storage.c > >> @@ -0,0 +1,247 @@ > >> +// SPDX-License-Identifier: GPL-2.0 > >> +/* > >> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. > >> + */ > >> + > >> +#include <linux/types.h> > >> +#include <linux/bpf.h> > >> +#include <linux/bpf_local_storage.h> > >> +#include <uapi/linux/btf.h> > >> +#include <linux/btf_ids.h> > >> + > >> +DEFINE_BPF_STORAGE_CACHE(cgroup_cache); > >> + > >> +static DEFINE_PER_CPU(int, bpf_cgrp_storage_busy); > >> + > >> +static void bpf_cgrp_storage_lock(void) > >> +{ > >> + migrate_disable(); > >> + this_cpu_inc(bpf_cgrp_storage_busy); > >> +} > >> + > >> +static void bpf_cgrp_storage_unlock(void) > >> +{ > >> + this_cpu_dec(bpf_cgrp_storage_busy); > >> + migrate_enable(); > >> +} > >> + > >> +static bool bpf_cgrp_storage_trylock(void) > >> +{ > >> + migrate_disable(); > >> + if (unlikely(this_cpu_inc_return(bpf_cgrp_storage_busy) != 1)) { > >> + this_cpu_dec(bpf_cgrp_storage_busy); > >> + migrate_enable(); > >> + return false; > >> + } > >> + return true; > >> +} > >> + > >> +static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner) > >> +{ > >> + struct cgroup *cg = owner; > >> + > >> + return &cg->bpf_cgrp_storage; > >> +} > >> + > >> +void bpf_cgrp_storage_free(struct cgroup *cgroup) > >> +{ > >> + struct bpf_local_storage *local_storage; > >> + bool free_cgroup_storage = false; > >> + unsigned long flags; > >> + > >> + rcu_read_lock(); > >> + local_storage = rcu_dereference(cgroup->bpf_cgrp_storage); > >> + if (!local_storage) { > >> + rcu_read_unlock(); > >> + return; > >> + } > >> + > >> + bpf_cgrp_storage_lock(); > >> + raw_spin_lock_irqsave(&local_storage->lock, flags); > >> + free_cgroup_storage = bpf_local_storage_unlink_nolock(local_storage); > >> + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > >> + bpf_cgrp_storage_unlock(); > >> + rcu_read_unlock(); > >> + > >> + if (free_cgroup_storage) > >> + kfree_rcu(local_storage, rcu); > >> +} > >> + > >> +static struct bpf_local_storage_data * > >> +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit) > >> +{ > >> + struct bpf_local_storage *cgroup_storage; > >> + struct bpf_local_storage_map *smap; > >> + > >> + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage, > >> + bpf_rcu_lock_held()); > >> + if (!cgroup_storage) > >> + return NULL; > >> + > >> + smap = (struct bpf_local_storage_map *)map; > >> + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > >> +} > >> + > >> +static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key) > >> +{ > >> + struct bpf_local_storage_data *sdata; > >> + struct cgroup *cgroup; > >> + int fd; > >> + > >> + fd = *(int *)key; > >> + cgroup = cgroup_get_from_fd(fd); > > > > Sorry I didn't notice this before, but is there a reason why only > > cgroup v2 is supported here? > > > > Can we also support cgroup v1 by using cgroup_v1v2_get_from_fd() > > instead, similar to cgroup_iter? or is there something else in the > > implementation that is cgroup v2 specific? > > I can do that but cgroup_v1v2_get_from_fd() is not in bpf-next now. > I guess we can either wait for it if it can be merged into bpf-next > soon or we can do it as a followup. It's in Linus's tree, I honestly don't know how often Linus's tree gets merged into bpf-next. > > > > >> + if (IS_ERR(cgroup)) > >> + return ERR_CAST(cgroup); > >> + > >> + bpf_cgrp_storage_lock(); > >> + sdata = cgroup_storage_lookup(cgroup, map, true); > >> + bpf_cgrp_storage_unlock(); > >> + cgroup_put(cgroup); > >> + return sdata ? sdata->data : NULL; > >> +} > >> +