Re: [PATCH v4 bpf-next 3/5] bpf: Make cgroup storages shared across attaches on the same cgroup

Martin KaFai Lau <kafai@xxxxxx> · Tue, 21 Jul 2020 17:09:13 -0700

On Tue, Jul 21, 2020 at 03:56:36PM -0700, sdf@xxxxxxxxxx wrote:
> On 07/21, Martin KaFai Lau wrote:
> > On Tue, Jul 21, 2020 at 04:20:00PM -0500, YiFei Zhu wrote:
> > > On Tue, Jul 21, 2020 at 1:05 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> > > > I quickly checked zero is actually BPF_CGROUP_INET_INGRESS instead
> > > > of UNSPEC for attach_type.  It will be confusing on the syscall
> > > > side. e.g. map dumping with key.attach_type == BPF_CGROUP_INET_INGRESS
> > > > while the only cgroup program is actually attaching to
> > BPF_CGROUP_INET_EGRESS.
> > > >
> > > > I don't have a clean way out for this.  Adding a non-zero UNSPEC
> > > > to "enum bpf_attach_type" seems wrong also.
> > > > One possible way out is to allow the bpf_cgroup_storage_map
> > > > to have a smaller key size (i.e. allow both sizeof(cgroup_inode_id)
> > > > and the existing sizeof(struct bpf_cgroup_storage_key).  That
> > > > will completely remove attach_type from the picture if the user
> > > > specified to use sizeof(cgroup_inode_id) as the key_size.
> > > > If sizeof(struct bpf_cgroup_storage_key) is specified, the attach_type
> > > > is still used in cmp().
> > > >
> > > > The bpf_cgroup_storage_key_cmp() need to do cmp accordingly.
> > > > Same changes is needed to lookup_elem, delete_elem, check_btf,
> > map_alloc...etc.
> > >
> > > ACK. Considering that the cgroup_inode_id is the first field of struct
> > > bpf_cgroup_storage_key, I can probably just use this fact and directly
> > > use the user-provided map key (after it's copied to kernel space) and
> > > cast the pointer to a __u64 - or are there any architectures where the
> > > first field is not at offset zero? If this is done then we can change
> > > all the kernel internal APIs to use __u64 cgroup_inode_id, ignoring
> > > the existence of the struct aside from the map creation (size checking
> > > and BTF checking).
> > Just to be clear.  The problem is the userspace is expecting
> > the whole key (cgroup_id, attach_type) to be meaningful and
> > we cannot stop supporting this existing key type now.
> To step back a bit. I think in the commit message we mentioned that
> attach_type is essentially (mostly) meaningless right now.
> If I want to share cgroup storage between BPF_CGROUP_INET_INGRESS and
> BPF_CGROUP_INET_EGRESS, the verifier will refuse to load those programs.
> So doing lookup with different attach_type for the same storage
> shouldn't really happen.
I think we are talking about the existing map->aux check in
bpf_cgroup_storage_assign()?  Right, the map is dedicated for
only one bpf prog which usually has only one expected_attach_type.
Thus, the attach_type of the key is not really useful since it will
always be the same.  However,  it does not mean the exposed
attach_type (of the key) is meaningless or the value is invalid.
The exposed value is valid.

I am trying to say attach_type is still part of the key
and exposed to userspace (e.g. in map dump) but it is sort of
invalid after this change because I am not sure what that "0"
means now.

Also, as part of the key, it is intuitive for user to think the storage is
unique per (cgroup, attach_type).  This uniqueness is always true now because
of the map->aux logic and guaranteed by the verifier.  With this patch, this
"key" intuition is gone where the kernel quietly ignore the attach_type.

> 
> Except. There is one use-case where it does make sense. If you take
> the same loaded program and attach it to both ingress and egress, each
> one will get a separate storage copy. And only in this case
> attach_type really has any meaning.
>
> But since more and more attach types started to require
> expected_attach_type, it seems that the case where the same
> prog is attached to two different hooks should be almost
> non-existent. That's why we've decided to go with
> the idea of completely ignoring it.
> 
> So the question is: is it really worth it trying to preserve
> that obscure use-case (and have more complex implementation) or we can
> safely assume that nobody is doing that sort of thing in the wild (and
> can simplify the internal logic a bit)?
I think it is a reasonable expectation for map-dump to show a
meaningful attach_type if it is indeed part of the key.