Re: [PATCH v2 bpf-next 3/5] bpf: Make cgroup storages shared across attaches on the same cgroup

YiFei Zhu <zhuyifei@xxxxxxxxxx> · Wed, 15 Jul 2020 16:06:45 -0500

> > +     list_for_each_entry_safe(storage, stmp, storages, list_cg) {
> > +             bpf_cgroup_storage_unlink(storage);
> > +             bpf_cgroup_storage_free(storage);
> cgroup_storage_map_free() is also doing unlink and free.
> It is not clear to me what prevent cgroup_bpf_release()
> and cgroup_storage_map_free() from doing unlink and free at the same time.
>
> A few words comment here would be useful if it is fine.

Good catch. This can happen. Considering that this section is guarded
by cgroup_mutex, and that cgroup_storage_map_free is called in
sleepable context (either workqueue or map_create) I think the most
straightforward way is to also hold the cgroup_mutex in
cgroup_storage_map_free. Will fix in v3.

> > @@ -458,10 +457,10 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
> >       if (bpf_cgroup_storages_alloc(storage, prog ? : link->link.prog))
> >               return -ENOMEM;
> >
> > +     bpf_cgroup_storages_link(storage, cgrp);
> here. After the new change in bpf_cgroup_storage_link(),
> the storage could be an old/existing storage that is
> being used by other bpf progs.

Right, I will swap storages_alloc to below this kmalloc and goto
cleanup if storages_alloc fails.

> > -     bpf_cgroup_storages_free(old_storage);
> >       if (old_prog)
> >               bpf_prog_put(old_prog);
> >       else
> >               static_branch_inc(&cgroup_bpf_enabled_key);
> > -     bpf_cgroup_storages_link(pl->storage, cgrp, type);
> Another side effect is, the "new" storage is still published to
> the map even the attach has failed.  I think this may be ok.

Correct. To really fix this we would need to keep track of exactly
which storages are new and which are old, and I don't think that doing
so has any significant benefits given that the lifetime of the
storages are still bounded. If userspace receives the error they are
probably going to either exit or retry. Exit will cause the storages
to be freed along with the map, and retry, if successful, needs the
storage be published anyways. That is the reasoning for thinking it is
okay.

> > -     next->attach_type = storage->key.attach_type;
> The map dump (e.g. bpftool map dump) will also show attach_type zero
> in the key now.  Please also mention that in the commit message.

Will fix in v3.

> This patch allows a cgroup-storage to be shared among different bpf-progs
> which is in the right direction that makes bpf_cgroup_storage_map behaves
> more like other bpf-maps do.  However, each bpf-prog can still only allow
> one "bpf_cgroup_storage_map" to be used (excluding the difference in the
> SHARED/PERCPU bpf_cgroup_storage_type).
> i.e. each bpf-prog can only access one type of cgroup-storage.
> e.g. prog-A stores storage-A.  If prog-B wants to store storage-B and
> also read storage-A, it is not possible if I read it correctly.

You are correct. It is still bound by the per-cpu variable, and
because the per-cpu variable is an array of storages, one for each
type, it still does not support more than one storage per type.

> While I think this patch is a fine extension to the existing
> bpf_cgroup_storage_map and a good step forward to make bpf_cgroup_storage_map
> sharable like other bpf maps do.  Have you looked at bpf_sk_storage.c which
> also defines a local storage for a sk but a bpf prog can define multiple
> storages to be stored in a sk.  It is doing similar thing of this
> patch (e.g. a link to the storage, another link to the map, the life
> time of the storage is tied to the map and the sk...etc.).  KP Singh is
> generalizing it such that bpf-prog can store data in potentially any
> kernel object other than sk [1].  His use case is to store data in inode.
> I think it can be used for the cgroup also.  The only thing missing there
> is the "PERCPU" type.  It was not there because there is no such need for sk
> but should be something quite doable.

I think this is a good idea, but it will be a much bigger project. I
would prefer to fix this problem of cgroup_storage not being shareable
first, and when KP's patches land I'll take a look at how to reuse
their code. And, as you said, it would be more involved due to needing
to add "PERCPU" support.

YiFei Zhu