Re: [PATCH v4 bpf-next 3/5] bpf: Make cgroup storages shared across attaches on the same cgroup

Martin KaFai Lau <kafai@xxxxxx> · Tue, 21 Jul 2020 15:41:58 -0700

On Tue, Jul 21, 2020 at 04:20:00PM -0500, YiFei Zhu wrote:
> On Tue, Jul 21, 2020 at 1:05 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> > I quickly checked zero is actually BPF_CGROUP_INET_INGRESS instead
> > of UNSPEC for attach_type.  It will be confusing on the syscall
> > side. e.g. map dumping with key.attach_type == BPF_CGROUP_INET_INGRESS
> > while the only cgroup program is actually attaching to BPF_CGROUP_INET_EGRESS.
> >
> > I don't have a clean way out for this.  Adding a non-zero UNSPEC
> > to "enum bpf_attach_type" seems wrong also.
> > One possible way out is to allow the bpf_cgroup_storage_map
> > to have a smaller key size (i.e. allow both sizeof(cgroup_inode_id)
> > and the existing sizeof(struct bpf_cgroup_storage_key).  That
> > will completely remove attach_type from the picture if the user
> > specified to use sizeof(cgroup_inode_id) as the key_size.
> > If sizeof(struct bpf_cgroup_storage_key) is specified, the attach_type
> > is still used in cmp().
> >
> > The bpf_cgroup_storage_key_cmp() need to do cmp accordingly.
> > Same changes is needed to lookup_elem, delete_elem, check_btf, map_alloc...etc.
> 
> ACK. Considering that the cgroup_inode_id is the first field of struct
> bpf_cgroup_storage_key, I can probably just use this fact and directly
> use the user-provided map key (after it's copied to kernel space) and
> cast the pointer to a __u64 - or are there any architectures where the
> first field is not at offset zero? If this is done then we can change
> all the kernel internal APIs to use __u64 cgroup_inode_id, ignoring
> the existence of the struct aside from the map creation (size checking
> and BTF checking).
Just to be clear.  The problem is the userspace is expecting
the whole key (cgroup_id, attach_type) to be meaningful and
we cannot stop supporting this existing key type now.

This patch essentially sets all attach_type to 0 to mean unused/UNSPEC/dont_care
but 0 is actually BPF_CGROUP_INET_INGRESS in bpf's uapi.  "map dump" will be an
issue as mentioned earlier.  Also, as a key of a map,
it is logical for the user to assume that different attach_type will have
different storage.  e.g. If user lookup by (cgroup_id, 1),  what does it mean
to quietly return the storage created by (cgroup_id, 0/2/3/4/5)?

I think from the map's perspective, it makes more sense to
have different "key-cmp" operation for different key type.  When the map
is created with key_size == (cgroup_id, attach_type),  the "cmp"
will stay as is to compare the attach_type also.  Then the
storage will be sharable among the same cgroup and the
same attach_type.

Then add support for the new key_size == (cgroup_id) which the "cmp" will
only compare the "cgroup_id" as this patch does so that the storage can be
shared among all attach_types of the same cgroup.

Does it make sense?