On Fri, Oct 21, 2022 at 10:33:41AM -0700, Yonghong Song wrote: [...] > > > /* Note that tracing related programs such as > > > @@ -5435,6 +5443,42 @@ union bpf_attr { > > > * **-E2BIG** if user-space has tried to publish a sample which is > > > * larger than the size of the ring buffer, or which cannot fit > > > * within a struct bpf_dynptr. > > > + * > > > + * void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags) > > > + * Description > > > + * Get a bpf_local_storage from the *cgroup*. > > > + * > > > + * Logically, it could be thought of as getting the value from > > > + * a *map* with *cgroup* as the **key**. From this > > > + * perspective, the usage is not much different from > > > + * **bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this > > > + * helper enforces the key must be a cgroup struct and the map must also > > > + * be a **BPF_MAP_TYPE_CGRP_STORAGE**. > > > + * > > > + * Underneath, the value is stored locally at *cgroup* instead of > > > + * the *map*. The *map* is used as the bpf-local-storage > > > + * "type". The bpf-local-storage "type" (i.e. the *map*) is > > > + * searched against all bpf_local_storage residing at *cgroup*. > > > > IMO this paragraph is a bit hard to parse. Please correct me if I'm > > wrong, but I think what it's trying to convey is that when an instance > > of cgroup bpf-local-storage is accessed by a program in e.g. > > bpf_cgrp_storage_get(), all of the cgroup bpf_local_storage entries are > > iterated over in the struct cgroup object until this program's local > > storage instance is found. Is that right? If so, perhaps something like > > this would be more clear: > > yes. your above interpretation is correct. > > > > > In reality, the local-storage value is embedded directly inside of the > > *cgroup* object itself, rather than being located in the > > **BPF_MAP_TYPE_CGRP_STORAGE** map. When the local-storage value is > > queried for some *map* on a *cgroup* object, the kernel will perform an > > O(n) iteration over all of the live local-storage values for that > > *cgroup* object until the local-storage value for the *map* is found. > > Sounds okay. I can change the explanation like the above. Thanks! > > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > > index 341c94f208f4..3a12e6b400a2 100644 > > > --- a/kernel/bpf/Makefile > > > +++ b/kernel/bpf/Makefile > > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) > > > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > > > endif > > > ifeq ($(CONFIG_CGROUPS),y) > > > > I assume that you double checked that it's valid to compile the helper > > with CONFIG_CGROUPS && !CONFIG_CGROUP_BPF, but I must admit that even if > > that's the case, I'm not following why we would want the map to be > > compiled with a different kconfig option than the helper that provides > > access to it. If theres's a precedent for doing this then I suppose it's > > fine, but it does seem wrong and/or at least wasteful to compile these > > helpers in if CONFIG_CGROUPS is defined but CONFIG_CGROUP_BPF is not. > > The following is my understanding. > CONFIG_CGROUP_BPF guards kernel/bpf/cgroup.c which contains implementation > mostly for cgroup-attached program types, helpers, etc. Then why are we using it to guard BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)? > A lot of other cgroup-related implementation like cgroup_iter, some > cgroup related helper (not related to cgroup-attached program types), etc. > are guarded with CONFIG_CGROUPS and CONFIG_BPF_SYSCALL. > > Note that it is totally possible CONFIG_CGROUP_BPF is 'n' while > CONFIG_CGROUPS and CONFIG_BPF_SYSCALL are 'y'. > > So for cgroup local storage implemented in this patch set, > using CONFIG_CGROUPS and CONFIG_BPF_SYSCALL seems okay. I agree that it's fine to use CONFIG_CGROUPS here. What I'm not understanding is why we're using CONFIG_CGROUP_BPF to guard defining BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops), and then in the Makefile we're using CONFIG_CGROUPS to add bpf_cgrp_storage.o. In other words, I think there's a mismatch between: --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops) #ifdef CONFIG_CGROUP_BPF ^^ why this instead of CONFIG_CGROUPS for BPF_MAP_TYPE_CGRP_STORAGE? BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops) and diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 341c94f208f4..3a12e6b400a2 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y) obj-$(CONFIG_BPF_SYSCALL) += stackmap.o endif ifeq ($(CONFIG_CGROUPS),y) -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o endif obj-$(CONFIG_CGROUP_BPF) += cgroup.o ifeq ($(CONFIG_INET),y) > > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o > > > endif > > > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > > > ifeq ($(CONFIG_INET),y) [...] > > > + * could be modifying the local_storage->list now. > > > + * Thus, no elem can be added-to or deleted-from the > > > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > > > + * > > > + * It is racing with bpf_local_storage_map_free() alone > > > + * when unlinking elem from the local_storage->list and > > > + * the map's bucket->list. > > > + */ > > > + bpf_cgrp_storage_lock(); > > > + raw_spin_lock_irqsave(&local_storage->lock, flags); > > > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > > > + bpf_selem_unlink_map(selem); > > > + free_cgroup_storage = > > > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > > > This still requires a comment explaining why it's OK to overwrite > > free_cgroup_storage with a previous value from calling > > bpf_selem_unlink_storage_nolock(). Even if that is safe, this looks like > > a pretty weird programming pattern, and IMO doing this feels more > > intentional and future-proof: > > > > if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false)) > > free_cgroup_storage = true; > > We have a comment a few lines below. > /* free_cgroup_storage should always be true as long as > * local_storage->list was non-empty. > */ > if (free_cgroup_storage) > kfree_rcu(local_storage, rcu); IMO that comment doesn't provide much useful information -- it states an assumption, but doesn't give a reason for it. > I will add more explanation in the above code like > > bpf_selem_unlink_map(selem); > /* If local_storage list only have one element, the > * bpf_selem_unlink_storage_nolock() will return true. > * Otherwise, it will return false. The current loop iteration > * intends to remove all local storage. So the last iteration > * of the loop will set the free_cgroup_storage to true. > */ > free_cgroup_storage = > bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); Thanks, this is the type of comment I was looking for. Also, I realize this was copy-pasted from a number of other possible locations in the codebase which are doing the same thing, but I still think this pattern is an odd and brittle way to do this. We're relying on an abstracted implementation detail of bpf_selem_unlink_storage_nolock() for correctness, which IMO is a signal that bpf_selem_unlink_storage_nolock() should probably be the one invoking kfree_rcu() on behalf of callers in the first place. It looks like all of the callers end up calling kfree_rcu() on the struct bpf_local_storage * if bpf_selem_unlink_storage_nolock() returns true, so can we just move the responsibility of freeing the local storage object down into bpf_selem_unlink_storage_nolock() where it's unlinked? IMO this can be done in a separate patch set, if we decide it's worth doing at all. > > > > > + } > > > + raw_spin_unlock_irqrestore(&local_storage->lock, flags); > > > + bpf_cgrp_storage_unlock(); > > > + rcu_read_unlock(); > > > + > > > + /* free_cgroup_storage should always be true as long as > > > + * local_storage->list was non-empty. > > > + */ > > > + if (free_cgroup_storage) > > > + kfree_rcu(local_storage, rcu); > > > +} > > > + > > > +static struct bpf_local_storage_data * > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit) > > > +{ > > > + struct bpf_local_storage *cgroup_storage; > > > + struct bpf_local_storage_map *smap; > > > + > > > + cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage, > > > + bpf_rcu_lock_held()); > > > + if (!cgroup_storage) > > > + return NULL; > > > + > > > + smap = (struct bpf_local_storage_map *)map; > > > + return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit); > > > +} > > > + > > > +static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key) > > > +{ > > > + struct bpf_local_storage_data *sdata; > > > + struct cgroup *cgroup; > > > + int fd; > > > + > > > + fd = *(int *)key; > > > + cgroup = cgroup_get_from_fd(fd); > > > + if (IS_ERR(cgroup)) > > > + return ERR_CAST(cgroup); > > > + > > > + bpf_cgrp_storage_lock(); > > > + sdata = cgroup_storage_lookup(cgroup, map, true); > > > + bpf_cgrp_storage_unlock(); > > > + cgroup_put(cgroup); > > > + return sdata ? sdata->data : NULL; > > > +} > > > > Stanislav pointed out in the v1 revision that there's a lot of very > > similar logic in task storage, and I think you'd mentioned that you were > > going to think about generalizing some of that. Have you had a chance to > > consider? > > It is hard to have a common function for > lookup_elem/update_elem/delete_elem(). They are quite different as each > heavily involves > task/cgroup-specific functions. Yes agreed, each implementation is acquiring their own references, and finding the backing element in whatever way it was implemented, etc. > but map_alloc and map_free could have common helpers. Agreed, and many of the static functions that are invoked on those paths such as bpf_cgrp_storage_free(), bpf_cgrp_storage_lock(), etc possibly as well. In general this feels like something we could pretty easily simplify using something like a structure with callbacks to implement the pieces of logic that are specific to each local storage type, such as getting the struct bpf_local_storage __rcu * pointer from some context (e.g. cgroup_storage_ptr()). It doesn't necessarily need to block this change, but IMO we should clean this up soon because a lot of this is nearly a 100% copy-paste of other local storage implementations. Thanks, David