On Thu, Jul 23, 2020 at 02:40:56AM -0500, YiFei Zhu wrote: > From: YiFei Zhu <zhuyifei@xxxxxxxxxx> > > This change comes in several parts: > > One, the restriction that the CGROUP_STORAGE map can only be used > by one program is removed. This results in the removal of the field > 'aux' in struct bpf_cgroup_storage_map, and removal of relevant > code associated with the field, and removal of now-noop functions > bpf_free_cgroup_storage and bpf_cgroup_storage_release. > > Second, we permit a key of type u64 as the key to the map. > Providing such a key type indicates that the map should ignore > attach type when comparing map keys. However, for simplicity newly > linked storage will still have the attach type at link time in > its key struct. cgroup_storage_check_btf is adapted to accept > u64 as the type of the key. > > Third, because the storages are now shared, the storages cannot > be unconditionally freed on program detach. There could be two > ways to solve this issue: > * A. Reference count the usage of the storages, and free when the > last program is detached. > * B. Free only when the storage is impossible to be referred to > again, i.e. when either the cgroup_bpf it is attached to, or > the map itself, is freed. > Option A has the side effect that, when the user detach and > reattach a program, whether the program gets a fresh storage > depends on whether there is another program attached using that > storage. This could trigger races if the user is multi-threaded, > and since nondeterminism in data races is evil, go with option B. > > The both the map and the cgroup_bpf now tracks their associated > storages, and the storage unlink and free are removed from > cgroup_bpf_detach and added to cgroup_bpf_release and > cgroup_storage_map_free. The latter also new holds the cgroup_mutex > to prevent any races with the former. > > Fourth, on attach, we reuse the old storage if the key already > exists in the map, via cgroup_storage_lookup. If the storage > does not exist yet, we create a new one, and publish it at the > last step in the attach process. This does not create a race > condition because for the whole attach the cgroup_mutex is held. > We keep track of an array of new storages that was allocated > and if the process fails only the new storages would get freed. > [ ... ] > diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c > index 51bd5a8cb01b..b246ae07f87d 100644 > --- a/kernel/bpf/local_storage.c > +++ b/kernel/bpf/local_storage.c > @@ -9,6 +9,8 @@ > #include <linux/slab.h> > #include <uapi/linux/btf.h> > > +#include "../cgroup/cgroup-internal.h" > + > DEFINE_PER_CPU(struct bpf_cgroup_storage*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); > > #ifdef CONFIG_CGROUP_BPF > @@ -20,7 +22,6 @@ struct bpf_cgroup_storage_map { > struct bpf_map map; > > spinlock_t lock; > - struct bpf_prog_aux *aux; > struct rb_root root; > struct list_head list; > }; > @@ -30,24 +31,41 @@ static struct bpf_cgroup_storage_map *map_to_storage(struct bpf_map *map) > return container_of(map, struct bpf_cgroup_storage_map, map); > } > > -static int bpf_cgroup_storage_key_cmp( > - const struct bpf_cgroup_storage_key *key1, > - const struct bpf_cgroup_storage_key *key2) > +static bool attach_type_isolated(const struct bpf_map *map) > { > - if (key1->cgroup_inode_id < key2->cgroup_inode_id) > - return -1; > - else if (key1->cgroup_inode_id > key2->cgroup_inode_id) > - return 1; > - else if (key1->attach_type < key2->attach_type) > - return -1; > - else if (key1->attach_type > key2->attach_type) > - return 1; > + return map->key_size == sizeof(struct bpf_cgroup_storage_key); > +} > + > +static int bpf_cgroup_storage_key_cmp(const struct bpf_cgroup_storage_map *map, > + const void *_key1, const void *_key2) > +{ > + if (attach_type_isolated(&map->map)) { > + const struct bpf_cgroup_storage_key *key1 = _key1; > + const struct bpf_cgroup_storage_key *key2 = _key2; > + > + if (key1->cgroup_inode_id < key2->cgroup_inode_id) > + return -1; > + else if (key1->cgroup_inode_id > key2->cgroup_inode_id) > + return 1; > + else if (key1->attach_type < key2->attach_type) > + return -1; > + else if (key1->attach_type > key2->attach_type) > + return 1; > + } else { > + const __u64 *key1 = _key1; > + const __u64 *key2 = _key2; Nit. Instead of key1 and key2, a meaningful name will be easier to read and tell what it is comparing about later, like the key1->"cgroup_inode_id" in the "if" case above. May be: __u64 cgroup_inode_id1 = *(__u64 *)_key1; __u64 cgroup_inode_id2 = *(__u64 *)_key2; > + > + if (*key1 < *key2) > + return -1; > + else if (*key1 > *key2) > + return 1; > + } > return 0; > } > Others LGTM