Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 11 Sep 2023 11:10:19 -0700

On Fri, Sep 8, 2023 at 7:39 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 9/9/2023 6:34 AM, Andrii Nakryiko wrote:
> > On Wed, Aug 30, 2023 at 11:29 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> Hi Andrii,
> >>
> >> On 8/26/2023 2:33 AM, Andrii Nakryiko wrote:
> >>> On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> SNIP
> >>>>>> Yes. bpf prog will use dynptr as the map key. The bpf program will use
> >>>>>> the same map helpers as hash map to operate on qp-trie and the verifier
> >>>>>> will be updated to allow using dynptr as map key for qp-trie.
> >>>>> And that's the problem I just mentioned.
> >>>>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
> >>>>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
> >>>> Sorry for misunderstanding your reply. But before switch to the kfunc
> >>>> way, could you please point me to some code or function which shows the
> >>>> specialty of PTR_MAP_KEY ?
> >>>>
> >>>>
> >>> Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
> >>> logic assumes that there is associated struct bpf_map * pointer from
> >>> which we know fixed-sized key length.
> >> Thanks for the information. Will check that.
> >>> But getting back to the topic at hand. I vaguely remember discussion
> >>> we had, but it would be good if you could summarize it again here to
> >>> avoid talking past each other. What is the bpf_map_ops changes you
> >>> were thinking to do? How bpf_attr will look like? How BPF-side API for
> >>> lookup/delete/update will look like? And then let's go from there?
> >>> Thanks!
> >> Sorry for the late reply. I am a bit distracted by other work this week.
> >>
> >> For bpf_attr, a new field 'dynkey_size' is added to support
> >> BPF_MAP_{LOOKUP/UPDATE/DELETE}_ELEM and BPF_MAP_GET_NEXT_KEY on qp-trie
> >> as shown below:
> >>
> >> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> >>         __u32           map_fd;
> >>         __aligned_u64   key;
> >>         union {
> >>                 __aligned_u64 value;
> >>                 __aligned_u64 next_key;
> >>         };
> >>         __u64           flags;
> >>         __u32           dynkey_size;    /* input/output for
> >>                                          * BPF_MAP_GET_NEXT_KEY. input
> >>                                          * only for other commands.
> >>                                          */
> > hm.. I wonder if it would be more elegant to add `key_size` and
> > `value_size`, and allow to specify it (optionally) even for maps that
> > have fixed-size keys and values. Return error if expected key/value
> > size doesn't match map definition. From libbpf side, libbpf can be
> > smart to not set it on older kernels (or if user didn't provide this
> > information). But for bpf_map__lookup_elem() and other higher-level
> > APIs, we should have all this information available.
>
> I am OK with the addition of key_size and value_size in bpf_attr and I
> will try to do that. After the addition, bpf syscall will also need to
> check these two size for fixed-size map, but I am a bit worried about
> the compatibility of libbpf and kernel for fixed-size map. There are
> three possible cases:
> 1) new libbpf and older kernel. key_size and value_size will be ignored
> by older kernel, because the definition of bpf_attr in older kernel is
> shorter. It will be OK.

Not ignored. Libbpf will have to zero out unknown fields. But yes, it
will be OK.

> 2) old libbpf and new kernel.  key_size and value_size will be zero for
> kernel, because libbpf doesn't pass these values. We can use 0 as an
> unspecified size, but for some map (e.g., bloom-filter) zero key_size is
> also valid, so do we need to introduce a feature bit to tell both
> key_size and value_size are valid ?

I wouldn't bother. Zero is "not provided". It will still be a valid
size for zero-sized key/value cases.

> 3) matched libbpf and kernel. Same problem as 2): use zero as an
> unspecified size or an extra flag is needed.

Again, I don't see a problem. Valid case (zero expected, zero
provided) will work correctly. The only case where we could have
caught an error would be expected non-zero size, but provided zero
size (which we treat as not specified size). Kernel won't reject it
outright, but that's backwards compatible behavior anyway, so I think
it's fine.

tl;dr I wouldn't bother with the new flag, it seems unnecessary.

>
> I totally missed these higher-level APIs with both key_sz and value_sz.
> It seems these high-level also need updates to support qp-trie (e.g.,
> bpf_map__get_next_key, because the size of current key and next key are
> different).
> >
> >

yep, we probably will have to add more generic bpf_map__next_key(map,
cur_key, cur_key_size, *next_key, *next_key_sz)

Keep in mind that for next_key operation the user should provide the
maximum next key size it expects, otherwise this kernel API will be
hard to use: very easy for the kernel to overrun a user-provided
buffer. So the next key size will be an in/out param. On input, it's
maximum expected size, on output -- actual filled out size.

> >> };
> >>
> >> And 4 new APIs are added in libbpf to support basic operations on qp-trie:
> >>
> >> LIBBPF_API int bpf_map_update_dynkey_elem(int fd, const void *key,
> >> unsigned int key_size, const void *value, __u64 flags);
> >> LIBBPF_API int bpf_map_lookup_dynkey_elem(int fd, const void *key,
> >> unsigned int key_size, void *value);
> >> LIBBPF_API int bpf_map_delete_dynkey_elem(int fd, const void *key,
> >> unsigned int key_size);
> >> LIBBPF_API int bpf_map_get_next_dynkey(int fd, const void *key, void
> >> *next_key, unsigned int *key_size);
> >>
> >> About 3 weeks again, I have used the lowest bit of key pointer in
> >> .map_lookup_elem/.map_update_elem/.map_delete_elem to distinguish
> >> between bpf_user_dynkey-typed key from syscall and bpf_dynptr_kern-typed
> >> key from bpf program. The definition of bpf_user_dynkey and its
> >> allocation method are shown below. bpf syscall uses it to allocate
> >> variable-sized key and passes it to qp-trie.
> >>
> >> /* Allocate bpf_user_dynkey and its data together */
> >> struct bpf_user_dynkey {
> >>         unsigned int size;
> >>         void *data;
> >> };
> >>
> >> static void *bpf_new_user_dynkey(unsigned int size)
> >> {
> >>         struct bpf_user_dynkey *dynkey;
> >>         size_t total;
> >>
> >>         total = round_up(sizeof(*dynkey) + size, 2);
> >>         dynkey = kvmalloc(total, GFP_USER | __GFP_NOWARN);
> >>         if (!dynkey)
> >>                 return ERR_PTR(-ENOMEM);
> >>
> >>         dynkey->size = size;
> >>         dynkey->data = &dynkey[1];
> >>         return (void *)((long)dynkey | BPF_USER_DYNKEY_MARK);
> >> }
> >>
> >>
> >> After Alexei suggested that bit hack is only OK for memory or
> >> performance reason, I'm planning to add 2 new callbacks in bpf_map_ops
> >> to support update/delete operations in bpf syscall as shown below, but I
> >> have tried it yet.
> >>
> >> /* map is generic key/value storage optionally accessible by eBPF
> >> programs */
> >> struct bpf_map_ops {
> >>         /* funcs callable from userspace (via syscall) */
> >>         /* ...... */
> >>         void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
> > a bit confused, did you mean to also have key_size as a third argument here?
>
> Ah, I am planning to pass bpf_user_dynkey to these newly-added APIs and
> bpf_user_dynke::size is the size of the key. Passing a plain pointer and
> key size is also fine.

Wait, you want to pass bpf_user_dynkey from user-space side? Why?
pointer + size as part of bpf_attr seems like a more straightforward
syscall-side interface, IMO.

> >
> >>         long (*map_update_elem_sys_only)(struct bpf_map *map, void *key,
> >> void *value, u64 flags);
> >>         long (*map_delete_elem_sys_only)(struct bpf_map *map, void *key);
> >>         /* ...... */
> >> };
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>