Re: [PATCH bpf-next v2 2/6] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

Yonghong Song <yhs@xxxxxxxx> · Fri, 21 Oct 2022 15:57:15 -0700

On 10/21/22 12:57 PM, David Vernet wrote:
On Fri, Oct 21, 2022 at 10:33:41AM -0700, Yonghong Song wrote:

[...]

   /* Note that tracing related programs such as
@@ -5435,6 +5443,42 @@ union bpf_attr {
    *		**-E2BIG** if user-space has tried to publish a sample which is
    *		larger than the size of the ring buffer, or which cannot fit
    *		within a struct bpf_dynptr.
+ *
+ * void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
+ *	Description
+ *		Get a bpf_local_storage from the *cgroup*.
+ *
+ *		Logically, it could be thought of as getting the value from
+ *		a *map* with *cgroup* as the **key**.  From this
+ *		perspective,  the usage is not much different from
+ *		**bpf_map_lookup_elem**\ (*map*, **&**\ *cgroup*) except this
+ *		helper enforces the key must be a cgroup struct and the map must also
+ *		be a **BPF_MAP_TYPE_CGRP_STORAGE**.
+ *
+ *		Underneath, the value is stored locally at *cgroup* instead of
+ *		the *map*.  The *map* is used as the bpf-local-storage
+ *		"type". The bpf-local-storage "type" (i.e. the *map*) is
+ *		searched against all bpf_local_storage residing at *cgroup*.

IMO this paragraph is a bit hard to parse. Please correct me if I'm
wrong, but I think what it's trying to convey is that when an instance
of cgroup bpf-local-storage is accessed by a program in e.g.
bpf_cgrp_storage_get(), all of the cgroup bpf_local_storage entries are
iterated over in the struct cgroup object until this program's local
storage instance is found. Is that right? If so, perhaps something like
this would be more clear:

yes. your above interpretation is correct.


In reality, the local-storage value is embedded directly inside of the
*cgroup* object itself, rather than being located in the
**BPF_MAP_TYPE_CGRP_STORAGE** map. When the local-storage value is
queried for some *map* on a *cgroup* object, the kernel will perform an
O(n) iteration over all of the live local-storage values for that
*cgroup* object until the local-storage value for the *map* is found.

Sounds okay. I can change the explanation like the above.

Thanks!

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..3a12e6b400a2 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
   obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
   endif
   ifeq ($(CONFIG_CGROUPS),y)

I assume that you double checked that it's valid to compile the helper
with CONFIG_CGROUPS && !CONFIG_CGROUP_BPF, but I must admit that even if
that's the case, I'm not following why we would want the map to be
compiled with a different kconfig option than the helper that provides
access to it. If theres's a precedent for doing this then I suppose it's
fine, but it does seem wrong and/or at least wasteful to compile these
helpers in if CONFIG_CGROUPS is defined but CONFIG_CGROUP_BPF is not.

The following is my understanding.
CONFIG_CGROUP_BPF guards kernel/bpf/cgroup.c which contains implementation
mostly for cgroup-attached program types, helpers, etc.

Then why are we using it to guard
BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)?

A lot of other cgroup-related implementation like cgroup_iter, some
cgroup related helper (not related to cgroup-attached program types), etc.
are guarded with CONFIG_CGROUPS and CONFIG_BPF_SYSCALL.

Note that it is totally possible CONFIG_CGROUP_BPF is 'n' while
CONFIG_CGROUPS and CONFIG_BPF_SYSCALL are 'y'.

So for cgroup local storage implemented in this patch set,
using CONFIG_CGROUPS and CONFIG_BPF_SYSCALL seems okay.

I agree that it's fine to use CONFIG_CGROUPS here. What I'm not
understanding is why we're using CONFIG_CGROUP_BPF to guard defining
BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops), and then
in the Makefile we're using CONFIG_CGROUPS to add bpf_cgrp_storage.o.

In other words, I think there's a mismatch between:

--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
  #ifdef CONFIG_CGROUP_BPF

^^ why this instead of CONFIG_CGROUPS for BPF_MAP_TYPE_CGRP_STORAGE?

  BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)
  #endif
  BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)

and

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 341c94f208f4..3a12e6b400a2 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -25,7 +25,7 @@ ifeq ($(CONFIG_PERF_EVENTS),y)
  obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
  endif
  ifeq ($(CONFIG_CGROUPS),y)
-obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
+obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o
  endif
  obj-$(CONFIG_CGROUP_BPF) += cgroup.o
  ifeq ($(CONFIG_INET),y)

This makes sense. I will guard
  BPF_MAP_TYPE(BPF_MAP_TYPE_CGRP_STORAGE, cgrp_storage_map_ops)
with CONFIG_CGROUPS.


-obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
+obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o bpf_cgrp_storage.o
   endif
   obj-$(CONFIG_CGROUP_BPF) += cgroup.o
   ifeq ($(CONFIG_INET),y)

[...]

+	 * could be modifying the local_storage->list now.
+	 * Thus, no elem can be added-to or deleted-from the
+	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
+	 *
+	 * It is racing with bpf_local_storage_map_free() alone
+	 * when unlinking elem from the local_storage->list and
+	 * the map's bucket->list.
+	 */
+	bpf_cgrp_storage_lock();
+	raw_spin_lock_irqsave(&local_storage->lock, flags);
+	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
+		bpf_selem_unlink_map(selem);
+		free_cgroup_storage =
+			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);

This still requires a comment explaining why it's OK to overwrite
free_cgroup_storage with a previous value from calling
bpf_selem_unlink_storage_nolock(). Even if that is safe, this looks like
a pretty weird programming pattern, and IMO doing this feels more
intentional and future-proof:

if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false))
	free_cgroup_storage = true;

We have a comment a few lines below.
   /* free_cgroup_storage should always be true as long as
    * local_storage->list was non-empty.
    */
   if (free_cgroup_storage)
	kfree_rcu(local_storage, rcu);

IMO that comment doesn't provide much useful information -- it states an
assumption, but doesn't give a reason for it.

I will add more explanation in the above code like

	bpf_selem_unlink_map(selem);
	/* If local_storage list only have one element, the
	 * bpf_selem_unlink_storage_nolock() will return true.
	 * Otherwise, it will return false. The current loop iteration
	 * intends to remove all local storage. So the last iteration
	 * of the loop will set the free_cgroup_storage to true.
	 */
	free_cgroup_storage =
		bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);

Thanks, this is the type of comment I was looking for.

Also, I realize this was copy-pasted from a number of other possible
locations in the codebase which are doing the same thing, but I still
think this pattern is an odd and brittle way to do this. We're relying
on an abstracted implementation detail of
bpf_selem_unlink_storage_nolock() for correctness, which IMO is a signal
that bpf_selem_unlink_storage_nolock() should probably be the one
invoking kfree_rcu() on behalf of callers in the first place.  It looks
like all of the callers end up calling kfree_rcu() on the struct
bpf_local_storage * if bpf_selem_unlink_storage_nolock() returns true,
so can we just move the responsibility of freeing the local storage
object down into bpf_selem_unlink_storage_nolock() where it's unlinked?

We probably cannot do this. bpf_selem_unlink_storage_nolock()
is inside the rcu_read_lock() region. We do kfree_rcu() outside
the rcu_read_lock() region.



IMO this can be done in a separate patch set, if we decide it's worth
doing at all.


+	}
+	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+	bpf_cgrp_storage_unlock();
+	rcu_read_unlock();
+
+	/* free_cgroup_storage should always be true as long as
+	 * local_storage->list was non-empty.
+	 */
+	if (free_cgroup_storage)
+		kfree_rcu(local_storage, rcu);
+}
+
+static struct bpf_local_storage_data *
+cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
+{
+	struct bpf_local_storage *cgroup_storage;
+	struct bpf_local_storage_map *smap;
+
+	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
+					       bpf_rcu_lock_held());
+	if (!cgroup_storage)
+		return NULL;
+
+	smap = (struct bpf_local_storage_map *)map;
+	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
+}
+
+static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_local_storage_data *sdata;
+	struct cgroup *cgroup;
+	int fd;
+
+	fd = *(int *)key;
+	cgroup = cgroup_get_from_fd(fd);
+	if (IS_ERR(cgroup))
+		return ERR_CAST(cgroup);
+
+	bpf_cgrp_storage_lock();
+	sdata = cgroup_storage_lookup(cgroup, map, true);
+	bpf_cgrp_storage_unlock();
+	cgroup_put(cgroup);
+	return sdata ? sdata->data : NULL;
+}

Stanislav pointed out in the v1 revision that there's a lot of very
similar logic in task storage, and I think you'd mentioned that you were
going to think about generalizing some of that. Have you had a chance to
consider?

It is hard to have a common function for
lookup_elem/update_elem/delete_elem(). They are quite different as each
heavily involves
task/cgroup-specific functions.

Yes agreed, each implementation is acquiring their own references, and
finding the backing element in whatever way it was implemented, etc.

but map_alloc and map_free could have common helpers.

Agreed, and many of the static functions that are invoked on those paths
such as bpf_cgrp_storage_free(), bpf_cgrp_storage_lock(), etc possibly
as well. In general this feels like something we could pretty easily
simplify using something like a structure with callbacks to implement
the pieces of logic that are specific to each local storage type, such
as getting the struct bpf_local_storage __rcu
* pointer from some context (e.g.  cgroup_storage_ptr()). It doesn't
necessarily need to block this change, but IMO we should clean this up
soon because a lot of this is nearly a 100% copy-paste of other local
storage implementations.

Further refactoring is possible. Martin is working to simplify the
locking mechanism. We can wait for that done before doing refactoring.


Thanks,
David