Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Sat, 9 Sep 2023 10:39:11 +0800

Hi,

On 9/9/2023 6:34 AM, Andrii Nakryiko wrote:
> On Wed, Aug 30, 2023 at 11:29 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> Hi Andrii,
>>
>> On 8/26/2023 2:33 AM, Andrii Nakryiko wrote:
>>> On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> SNIP
>>>>>> Yes. bpf prog will use dynptr as the map key. The bpf program will use
>>>>>> the same map helpers as hash map to operate on qp-trie and the verifier
>>>>>> will be updated to allow using dynptr as map key for qp-trie.
>>>>> And that's the problem I just mentioned.
>>>>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
>>>>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
>>>> Sorry for misunderstanding your reply. But before switch to the kfunc
>>>> way, could you please point me to some code or function which shows the
>>>> specialty of PTR_MAP_KEY ?
>>>>
>>>>
>>> Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
>>> logic assumes that there is associated struct bpf_map * pointer from
>>> which we know fixed-sized key length.
>> Thanks for the information. Will check that.
>>> But getting back to the topic at hand. I vaguely remember discussion
>>> we had, but it would be good if you could summarize it again here to
>>> avoid talking past each other. What is the bpf_map_ops changes you
>>> were thinking to do? How bpf_attr will look like? How BPF-side API for
>>> lookup/delete/update will look like? And then let's go from there?
>>> Thanks!
>> Sorry for the late reply. I am a bit distracted by other work this week.
>>
>> For bpf_attr, a new field 'dynkey_size' is added to support
>> BPF_MAP_{LOOKUP/UPDATE/DELETE}_ELEM and BPF_MAP_GET_NEXT_KEY on qp-trie
>> as shown below:
>>
>> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
>>         __u32           map_fd;
>>         __aligned_u64   key;
>>         union {
>>                 __aligned_u64 value;
>>                 __aligned_u64 next_key;
>>         };
>>         __u64           flags;
>>         __u32           dynkey_size;    /* input/output for
>>                                          * BPF_MAP_GET_NEXT_KEY. input
>>                                          * only for other commands.
>>                                          */
> hm.. I wonder if it would be more elegant to add `key_size` and
> `value_size`, and allow to specify it (optionally) even for maps that
> have fixed-size keys and values. Return error if expected key/value
> size doesn't match map definition. From libbpf side, libbpf can be
> smart to not set it on older kernels (or if user didn't provide this
> information). But for bpf_map__lookup_elem() and other higher-level
> APIs, we should have all this information available.

I am OK with the addition of key_size and value_size in bpf_attr and I
will try to do that. After the addition, bpf syscall will also need to
check these two size for fixed-size map, but I am a bit worried about
the compatibility of libbpf and kernel for fixed-size map. There are
three possible cases:
1) new libbpf and older kernel. key_size and value_size will be ignored
by older kernel, because the definition of bpf_attr in older kernel is
shorter. It will be OK.
2) old libbpf and new kernel.  key_size and value_size will be zero for
kernel, because libbpf doesn't pass these values. We can use 0 as an
unspecified size, but for some map (e.g., bloom-filter) zero key_size is
also valid, so do we need to introduce a feature bit to tell both
key_size and value_size are valid ?
3) matched libbpf and kernel. Same problem as 2): use zero as an
unspecified size or an extra flag is needed.

I totally missed these higher-level APIs with both key_sz and value_sz. 
It seems these high-level also need updates to support qp-trie (e.g.,
bpf_map__get_next_key, because the size of current key and next key are
different).
>
>
>> };
>>
>> And 4 new APIs are added in libbpf to support basic operations on qp-trie:
>>
>> LIBBPF_API int bpf_map_update_dynkey_elem(int fd, const void *key,
>> unsigned int key_size, const void *value, __u64 flags);
>> LIBBPF_API int bpf_map_lookup_dynkey_elem(int fd, const void *key,
>> unsigned int key_size, void *value);
>> LIBBPF_API int bpf_map_delete_dynkey_elem(int fd, const void *key,
>> unsigned int key_size);
>> LIBBPF_API int bpf_map_get_next_dynkey(int fd, const void *key, void
>> *next_key, unsigned int *key_size);
>>
>> About 3 weeks again, I have used the lowest bit of key pointer in
>> .map_lookup_elem/.map_update_elem/.map_delete_elem to distinguish
>> between bpf_user_dynkey-typed key from syscall and bpf_dynptr_kern-typed
>> key from bpf program. The definition of bpf_user_dynkey and its
>> allocation method are shown below. bpf syscall uses it to allocate
>> variable-sized key and passes it to qp-trie.
>>
>> /* Allocate bpf_user_dynkey and its data together */
>> struct bpf_user_dynkey {
>>         unsigned int size;
>>         void *data;
>> };
>>
>> static void *bpf_new_user_dynkey(unsigned int size)
>> {
>>         struct bpf_user_dynkey *dynkey;
>>         size_t total;
>>
>>         total = round_up(sizeof(*dynkey) + size, 2);
>>         dynkey = kvmalloc(total, GFP_USER | __GFP_NOWARN);
>>         if (!dynkey)
>>                 return ERR_PTR(-ENOMEM);
>>
>>         dynkey->size = size;
>>         dynkey->data = &dynkey[1];
>>         return (void *)((long)dynkey | BPF_USER_DYNKEY_MARK);
>> }
>>
>>
>> After Alexei suggested that bit hack is only OK for memory or
>> performance reason, I'm planning to add 2 new callbacks in bpf_map_ops
>> to support update/delete operations in bpf syscall as shown below, but I
>> have tried it yet.
>>
>> /* map is generic key/value storage optionally accessible by eBPF
>> programs */
>> struct bpf_map_ops {
>>         /* funcs callable from userspace (via syscall) */
>>         /* ...... */
>>         void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
> a bit confused, did you mean to also have key_size as a third argument here?

Ah, I am planning to pass bpf_user_dynkey to these newly-added APIs and
bpf_user_dynke::size is the size of the key. Passing a plain pointer and
key size is also fine.
>
>>         long (*map_update_elem_sys_only)(struct bpf_map *map, void *key,
>> void *value, u64 flags);
>>         long (*map_delete_elem_sys_only)(struct bpf_map *map, void *key);
>>         /* ...... */
>> };
>>
>>
>>
>>
>>
>>
>>