Re: [PATCH bpf-next v2 2/6] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

David Vernet <void@xxxxxxxxxxxxx> · Fri, 21 Oct 2022 14:57:21 -0500

On Fri, Oct 21, 2022 at 10:33:41AM -0700, Yonghong Song wrote:

[...]

> > >   /* Note that tracing related programs such as
> > > @@ -5435,6 +5443,42 @@ union bpf_attr {
> > >    *		**-E2BIG** if user-space has tried to publish a sample which is
> > >    *		larger than the size of the ring buffer, or which cannot fit
> > >    *		within a struct bpf_dynptr.
> > > + *
> > > + * void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
> > > + *	Description
> > > + *		Get a bpf_local_storage from the *cgroup*.
> > > + *
> > > + *		Logically, it could be thought of as getting the value from
> > > + *		a *map* with *cgroup* as the **key**.  From this
> > > + *		perspective,  the usage is not much different from
> > > + *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
> > > + *		helper enforces the key must be a cgroup struct and the map must also
> > > + *		be a **BPF_MAP_TYPE_CGRP_STORAGE**.
> > > + *
> > > + *		Underneath, the value is stored locally at *cgroup* instead of
> > > + *		the *map*.  The *map* is used as the bpf-local-storage
> > > + *		"type". The bpf-local-storage "type" (i.e. the *map*) is
> > > + *		searched against all bpf_local_storage residing at *cgroup*.
> > 
> > IMO this paragraph is a bit hard to parse. Please correct me if I'm
> > wrong, but I think what it's trying to convey is that when an instance
> > of cgroup bpf-local-storage is accessed by a program in e.g.
> > bpf_cgrp_storage_get(), all of the cgroup bpf_local_storage entries are
> > iterated over in the struct cgroup object until this program's local
> > storage instance is found. Is that right? If so, perhaps something like
> > this would be more clear:
> 
> yes. your above interpretation is correct.
> 
> > 
> > In reality, the local-storage value is embedded directly inside of the
> > *cgroup* object itself, rather than being located in the
> > **BPF_MAP_TYPE_CGRP_STORAGE** map. When the local-storage value is
> > queried for some *map* on a *cgroup* object, the kernel will perform an
> > O(n) iteration over all of the live local-storage values for that
> > *cgroup* object until the local-storage value for the *map* is found.
> 
> Sounds okay. I can change the explanation like the above.

Thanks!

> > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > > index 341c94f208f4..3a12e6b400a2 100644
> > > --- a/kernel/bpf/Makefile
> > > +++ b/kernel/bpf/Makefile
> > > @@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
> > >   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
> > >   endif
> > >   ifeq ($(CONFIG_CGROUPS),y)
> > 
> > I assume that you double checked that it's valid to compile the helper
> > with CONFIG_CGROUPS && !CONFIG_CGROUP_BPF, but I must admit that even if
> > that's the case, I'm not following why we would want the map to be
> > compiled with a different kconfig option than the helper that provides
> > access to it. If theres's a precedent for doing this then I suppose it's
> > fine, but it does seem wrong and/or at least wasteful to compile these
> > helpers in if CONFIG_CGROUPS is defined but CONFIG_CGROUP_BPF is not.
> 
> The following is my understanding.
> CONFIG_CGROUP_BPF guards kernel/bpf/cgroup.c which contains implementation
> mostly for cgroup-attached program types, helpers, etc.

Then why are we using it to guard
BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)?

> A lot of other cgroup-related implementation like cgroup_iter, some
> cgroup related helper (not related to cgroup-attached program types), etc.
> are guarded with CONFIG_CGROUPS and CONFIG_BPF_SYSCALL.
> 
> Note that it is totally possible CONFIG_CGROUP_BPF is 'n' while
> CONFIG_CGROUPS and CONFIG_BPF_SYSCALL are 'y'.
> 
> So for cgroup local storage implemented in this patch set,
> using CONFIG_CGROUPS and CONFIG_BPF_SYSCALL seems okay.

I agree that it's fine to use CONFIG_CGROUPS here. What I'm not
understanding is why we're using CONFIG_CGROUP_BPF to guard defining
BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops), and then
in the Makefile we're using CONFIG_CGROUPS to add bpf_cgrp_storage.o.

In other words, I think there's a mismatch between:

--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
 #ifdef CONFIG_CGROUP_BPF

^^ why this instead of CONFIG_CGROUPS for BPF_MAP_TYPE_CGRP_STORAGE?

 BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)

and

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..3a12e6b400a2 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
 ifeq ($(CONFIG_CGROUPS),y)
-obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
+obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o
 endif
 obj-$(CONFIG_CGROUP_BPF) += cgroup.o
 ifeq ($(CONFIG_INET),y)

> > > -obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
> > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o
> > >   endif
> > >   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
> > >   ifeq ($(CONFIG_INET),y)

[...]

> > > +	 * could be modifying the local_storage->list now.
> > > +	 * Thus, no elem can be added-to or deleted-from the
> > > +	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> > > +	 *
> > > +	 * It is racing with bpf_local_storage_map_free() alone
> > > +	 * when unlinking elem from the local_storage->list and
> > > +	 * the map's bucket->list.
> > > +	 */
> > > +	bpf_cgrp_storage_lock();
> > > +	raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > +	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > +		bpf_selem_unlink_map(selem);
> > > +		free_cgroup_storage =
> > > +			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > 
> > This still requires a comment explaining why it's OK to overwrite
> > free_cgroup_storage with a previous value from calling
> > bpf_selem_unlink_storage_nolock(). Even if that is safe, this looks like
> > a pretty weird programming pattern, and IMO doing this feels more
> > intentional and future-proof:
> > 
> > if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false))
> > 	free_cgroup_storage = true;
> 
> We have a comment a few lines below.
>   /* free_cgroup_storage should always be true as long as
>    * local_storage->list was non-empty.
>    */
>   if (free_cgroup_storage)
> 	kfree_rcu(local_storage, rcu);

IMO that comment doesn't provide much useful information -- it states an
assumption, but doesn't give a reason for it.

> I will add more explanation in the above code like
> 
> 	bpf_selem_unlink_map(selem);
> 	/* If local_storage list only have one element, the
> 	 * bpf_selem_unlink_storage_nolock() will return true.
> 	 * Otherwise, it will return false. The current loop iteration
> 	 * intends to remove all local storage. So the last iteration
> 	 * of the loop will set the free_cgroup_storage to true.
> 	 */
> 	free_cgroup_storage =
> 		bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);

Thanks, this is the type of comment I was looking for.

Also, I realize this was copy-pasted from a number of other possible
locations in the codebase which are doing the same thing, but I still
think this pattern is an odd and brittle way to do this. We're relying
on an abstracted implementation detail of
bpf_selem_unlink_storage_nolock() for correctness, which IMO is a signal
that bpf_selem_unlink_storage_nolock() should probably be the one
invoking kfree_rcu() on behalf of callers in the first place.  It looks
like all of the callers end up calling kfree_rcu() on the struct
bpf_local_storage * if bpf_selem_unlink_storage_nolock() returns true,
so can we just move the responsibility of freeing the local storage
object down into bpf_selem_unlink_storage_nolock() where it's unlinked?

IMO this can be done in a separate patch set, if we decide it's worth
doing at all.

> > 
> > > +	}
> > > +	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > +	bpf_cgrp_storage_unlock();
> > > +	rcu_read_unlock();
> > > +
> > > +	/* free_cgroup_storage should always be true as long as
> > > +	 * local_storage->list was non-empty.
> > > +	 */
> > > +	if (free_cgroup_storage)
> > > +		kfree_rcu(local_storage, rcu);
> > > +}
> > > +
> > > +static struct bpf_local_storage_data *
> > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
> > > +{
> > > +	struct bpf_local_storage *cgroup_storage;
> > > +	struct bpf_local_storage_map *smap;
> > > +
> > > +	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
> > > +					       bpf_rcu_lock_held());
> > > +	if (!cgroup_storage)
> > > +		return NULL;
> > > +
> > > +	smap = (struct bpf_local_storage_map *)map;
> > > +	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> > > +}
> > > +
> > > +static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
> > > +{
> > > +	struct bpf_local_storage_data *sdata;
> > > +	struct cgroup *cgroup;
> > > +	int fd;
> > > +
> > > +	fd = *(int *)key;
> > > +	cgroup = cgroup_get_from_fd(fd);
> > > +	if (IS_ERR(cgroup))
> > > +		return ERR_CAST(cgroup);
> > > +
> > > +	bpf_cgrp_storage_lock();
> > > +	sdata = cgroup_storage_lookup(cgroup, map, true);
> > > +	bpf_cgrp_storage_unlock();
> > > +	cgroup_put(cgroup);
> > > +	return sdata ? sdata->data : NULL;
> > > +}
> > 
> > Stanislav pointed out in the v1 revision that there's a lot of very
> > similar logic in task storage, and I think you'd mentioned that you were
> > going to think about generalizing some of that. Have you had a chance to
> > consider?
> 
> It is hard to have a common function for
> lookup_elem/update_elem/delete_elem(). They are quite different as each
> heavily involves
> task/cgroup-specific functions.

Yes agreed, each implementation is acquiring their own references, and
finding the backing element in whatever way it was implemented, etc.

> but map_alloc and map_free could have common helpers.

Agreed, and many of the static functions that are invoked on those paths
such as bpf_cgrp_storage_free(), bpf_cgrp_storage_lock(), etc possibly
as well. In general this feels like something we could pretty easily
simplify using something like a structure with callbacks to implement
the pieces of logic that are specific to each local storage type, such
as getting the struct bpf_local_storage __rcu
* pointer from some context (e.g.  cgroup_storage_ptr()). It doesn't
necessarily need to block this change, but IMO we should clean this up
soon because a lot of this is nearly a 100% copy-paste of other local
storage implementations.

Thanks,
David