Re: [PATCH v2 bpf-next 3/5] bpf: Make cgroup storages shared across attaches on the same cgroup

Martin KaFai Lau <kafai@xxxxxx> · Tue, 14 Jul 2020 16:53:44 -0700

On Thu, Jul 09, 2020 at 05:54:49PM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <zhuyifei@xxxxxxxxxx>
> 
> This change comes in several parts:
> 
> One, the restriction that the CGROUP_STORAGE map can only be used
> by one program is removed. This results in the removal of the field
> 'aux' in struct bpf_cgroup_storage_map, and removal of relevant
> code associated with the field, and removal of now-noop functions
> bpf_free_cgroup_storage and bpf_cgroup_storage_release.
> 
> Second, because there could be multiple attach types to the same
> cgroup, the attach type is completely ignored on comparison in
> the map key. Newly added keys have it zeroed. The only value in
> the key that still matters is the cgroup inode.
> 
> Third, because the storages are now shared, the storages cannot
> be unconditionally freed on program detach. There could be two
> ways to solve this issue:
> * A. Reference count the usage of the storages, and free when the
>      last program is detached.
> * B. Free only when the storage is impossible to be referred to
>      again, i.e. when either the cgroup_bpf it is attached to, or
>      the map itself, is freed.
> Option A has the side effect that, when the user detach and
> reattach a program, whether the program gets a fresh storage
> depends on whether there is another program attached using that
> storage. This could trigger races if the user is multi-threaded,
> and since nondeterminism in data races is evil, go with option B.
> 
> The both the map and the cgroup_bpf now tracks their associated
> storages, and the storage unlink and free are removed from
> cgroup_bpf_detach and added to cgroup_bpf_release and
> cgroup_storage_map_free. Storages are now always individually
> unlinked so the function bpf_cgroup_storages_unlink is now unused
> and removed.
> 
> Fourth, on attach, we reuse the old storage if the key already
> exists in the map. Because the rbtree traversal holds the spinlock
> of the map, during which we can't allocate a new storage if we
> don't find an old storage, instead we preallocate the storage
> unconditionally, and free the preallocated storage if we find an
> old storage in the map. This results in a change of semantics in
> bpf_cgroup_storage{,s}_link, and rename cgroup_storage_insert to
> cgroup_storage_lookup_insert that does both lookup and conditionally
> insert or free.
> 
> Signed-off-by: YiFei Zhu <zhuyifei@xxxxxxxxxx>

[ ... ]

> @@ -101,22 +93,23 @@ static void cgroup_bpf_release(struct work_struct *work)
>  	struct cgroup *p, *cgrp = container_of(work, struct cgroup,
>  					       bpf.release_work);
>  	struct bpf_prog_array *old_array;
> +	struct list_head *storages = &cgrp->bpf.storages;
> +	struct bpf_cgroup_storage *storage, *stmp;
> +
>  	unsigned int type;
>  
>  	mutex_lock(&cgroup_mutex);
>  
>  	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
>  		struct list_head *progs = &cgrp->bpf.progs[type];
> -		struct bpf_prog_list *pl, *tmp;
> +		struct bpf_prog_list *pl, *pltmp;
>  
> -		list_for_each_entry_safe(pl, tmp, progs, node) {
> +		list_for_each_entry_safe(pl, pltmp, progs, node) {
>  			list_del(&pl->node);
>  			if (pl->prog)
>  				bpf_prog_put(pl->prog);
>  			if (pl->link)
>  				bpf_cgroup_link_auto_detach(pl->link);
> -			bpf_cgroup_storages_unlink(pl->storage);
> -			bpf_cgroup_storages_free(pl->storage);
>  			kfree(pl);
>  			static_branch_dec(&cgroup_bpf_enabled_key);
>  		}
> @@ -126,6 +119,11 @@ static void cgroup_bpf_release(struct work_struct *work)
>  		bpf_prog_array_free(old_array);
>  	}
>  
> +	list_for_each_entry_safe(storage, stmp, storages, list_cg) {
> +		bpf_cgroup_storage_unlink(storage);
> +		bpf_cgroup_storage_free(storage);
cgroup_storage_map_free() is also doing unlink and free.
It is not clear to me what prevent cgroup_bpf_release()
and cgroup_storage_map_free() from doing unlink and free at the same time.

A few words comment here would be useful if it is fine.

> +	}
> +
>  	mutex_unlock(&cgroup_mutex);
>  
>  	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
> @@ -290,6 +288,8 @@ int cgroup_bpf_inherit(struct cgroup *cgrp)
>  	for (i = 0; i < NR; i++)
>  		INIT_LIST_HEAD(&cgrp->bpf.progs[i]);
>  
> +	INIT_LIST_HEAD(&cgrp->bpf.storages);
> +
>  	for (i = 0; i < NR; i++)
>  		if (compute_effective_progs(cgrp, i, &arrays[i]))
>  			goto cleanup;
> @@ -422,7 +422,6 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
>  	struct list_head *progs = &cgrp->bpf.progs[type];
>  	struct bpf_prog *old_prog = NULL;
>  	struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
> -	struct bpf_cgroup_storage *old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
>  	struct bpf_prog_list *pl;
>  	int err;
>  
> @@ -458,10 +457,10 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
>  	if (bpf_cgroup_storages_alloc(storage, prog ? : link->link.prog))
>  		return -ENOMEM;
>  
> +	bpf_cgroup_storages_link(storage, cgrp);
here. After the new change in bpf_cgroup_storage_link(),
the storage could be an old/existing storage that is
being used by other bpf progs.

> +
>  	if (pl) {
>  		old_prog = pl->prog;
> -		bpf_cgroup_storages_unlink(pl->storage);
> -		bpf_cgroup_storages_assign(old_storage, pl->storage);
>  	} else {
>  		pl = kmalloc(sizeof(*pl), GFP_KERNEL);
>  		if (!pl) {
Pasting the context cut-out by git here:
>			bpf_cgroup_storages_free(storage);
It doesn't seem right to free here if the storage is "old".

>			return -ENOMEM;
>		}

> @@ -480,12 +479,10 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
>  	if (err)
>  		goto cleanup;
>  
> -	bpf_cgroup_storages_free(old_storage);
>  	if (old_prog)
>  		bpf_prog_put(old_prog);
>  	else
>  		static_branch_inc(&cgroup_bpf_enabled_key);
> -	bpf_cgroup_storages_link(pl->storage, cgrp, type);
Another side effect is, the "new" storage is still published to
the map even the attach has failed.  I think this may be ok.

>  	return 0;
>  
>  cleanup:
> @@ -493,9 +490,6 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
>  		pl->prog = old_prog;
>  		pl->link = NULL;
>  	}
> -	bpf_cgroup_storages_free(pl->storage);
> -	bpf_cgroup_storages_assign(pl->storage, old_storage);
> -	bpf_cgroup_storages_link(pl->storage, cgrp, type);
>  	if (!old_prog) {
>  		list_del(&pl->node);
>  		kfree(pl);
> @@ -679,8 +673,6 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog,
>  
>  	/* now can actually delete it from this cgroup list */
>  	list_del(&pl->node);
> -	bpf_cgroup_storages_unlink(pl->storage);
> -	bpf_cgroup_storages_free(pl->storage);
>  	kfree(pl);
>  	if (list_empty(progs))
>  		/* last program was detached, reset flags to zero */

[ ... ]

> diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
> index 51bd5a8cb01b..3baac07bc65c 100644
> --- a/kernel/bpf/local_storage.c
> +++ b/kernel/bpf/local_storage.c
> @@ -20,7 +20,6 @@ struct bpf_cgroup_storage_map {
>  	struct bpf_map map;
>  
>  	spinlock_t lock;
> -	struct bpf_prog_aux *aux;
>  	struct rb_root root;
>  	struct list_head list;
>  };
> @@ -38,10 +37,6 @@ static int bpf_cgroup_storage_key_cmp(
>  		return -1;
>  	else if (key1->cgroup_inode_id > key2->cgroup_inode_id)
>  		return 1;
> -	else if (key1->attach_type < key2->attach_type)
> -		return -1;
> -	else if (key1->attach_type > key2->attach_type)
> -		return 1;
>  	return 0;
>  }
>  
> @@ -81,8 +76,9 @@ static struct bpf_cgroup_storage *cgroup_storage_lookup(
>  	return NULL;
>  }
>  
> -static int cgroup_storage_insert(struct bpf_cgroup_storage_map *map,
> -				 struct bpf_cgroup_storage *storage)
> +static struct bpf_cgroup_storage *
> +cgroup_storage_lookup_insert(struct bpf_cgroup_storage_map *map,
> +			     struct bpf_cgroup_storage *storage)
>  {
>  	struct rb_root *root = &map->root;
>  	struct rb_node **new = &(root->rb_node), *parent = NULL;
> @@ -101,14 +97,15 @@ static int cgroup_storage_insert(struct bpf_cgroup_storage_map *map,
>  			new = &((*new)->rb_right);
>  			break;
>  		default:
> -			return -EEXIST;
> +			bpf_cgroup_storage_free(storage);
> +			return this;
>  		}
>  	}
>  
>  	rb_link_node(&storage->node, parent, new);
>  	rb_insert_color(&storage->node, root);
>  
> -	return 0;
> +	return NULL;
>  }
>  
>  static void *cgroup_storage_lookup_elem(struct bpf_map *_map, void *_key)
> @@ -131,10 +128,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key,
>  	struct bpf_cgroup_storage *storage;
>  	struct bpf_storage_buffer *new;
>  
> -	if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST | BPF_NOEXIST)))
> -		return -EINVAL;
> -
> -	if (unlikely(flags & BPF_NOEXIST))
> +	if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST)))
>  		return -EINVAL;
>  
>  	if (unlikely((flags & BPF_F_LOCK) &&
> @@ -250,16 +244,15 @@ static int cgroup_storage_get_next_key(struct bpf_map *_map, void *_key,
>  		if (!storage)
>  			goto enoent;
>  
> -		storage = list_next_entry(storage, list);
> +		storage = list_next_entry(storage, list_map);
>  		if (!storage)
>  			goto enoent;
>  	} else {
>  		storage = list_first_entry(&map->list,
> -					 struct bpf_cgroup_storage, list);
> +					 struct bpf_cgroup_storage, list_map);
>  	}
>  
>  	spin_unlock_bh(&map->lock);
> -	next->attach_type = storage->key.attach_type;
The map dump (e.g. bpftool map dump) will also show attach_type zero
in the key now.  Please also mention that in the commit message.

>  	next->cgroup_inode_id = storage->key.cgroup_inode_id;
>  	return 0;
>  
> @@ -318,6 +311,13 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
>  static void cgroup_storage_map_free(struct bpf_map *_map)
>  {
>  	struct bpf_cgroup_storage_map *map = map_to_storage(_map);
> +	struct list_head *storages = &map->list;
> +	struct bpf_cgroup_storage *storage, *stmp;
> +
> +	list_for_each_entry_safe(storage, stmp, storages, list_map) {
> +		bpf_cgroup_storage_unlink(storage);
> +		bpf_cgroup_storage_free(storage);
> +	}
>  
>  	WARN_ON(!RB_EMPTY_ROOT(&map->root));
>  	WARN_ON(!list_empty(&map->list));

For the high level consideration:

In general, the idea is to allow a bpf-prog to store something locally
at a cgroup.  What to store at the cgroup is defined by the "value" of
the "bpf_cgroup_storage_map".  i.e. The map helps to define
what cgroup-storage a bpf-prog wants to store (and the map also
keeps track of what cgroups have this storage).

This patch allows a cgroup-storage to be shared among different bpf-progs
which is in the right direction that makes bpf_cgroup_storage_map behaves
more like other bpf-maps do.  However, each bpf-prog can still only allow
one "bpf_cgroup_storage_map" to be used (excluding the difference in the
SHARED/PERCPU bpf_cgroup_storage_type).
i.e. each bpf-prog can only access one type of cgroup-storage.
e.g. prog-A stores storage-A.  If prog-B wants to store storage-B and
also read storage-A, it is not possible if I read it correctly.

While I think this patch is a fine extension to the existing
bpf_cgroup_storage_map and a good step forward to make bpf_cgroup_storage_map
sharable like other bpf maps do.  Have you looked at bpf_sk_storage.c which
also defines a local storage for a sk but a bpf prog can define multiple
storages to be stored in a sk.  It is doing similar thing of this
patch (e.g. a link to the storage, another link to the map, the life
time of the storage is tied to the map and the sk...etc.).  KP Singh is
generalizing it such that bpf-prog can store data in potentially any
kernel object other than sk [1].  His use case is to store data in inode.
I think it can be used for the cgroup also.  The only thing missing there
is the "PERCPU" type.  It was not there because there is no such need for sk
but should be something quite doable.

[1]: https://patchwork.ozlabs.org/project/netdev/patch/20200709101239.3829793-2-kpsingh@xxxxxxxxxxxx/