Re: APIs for qp-trie //Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Thu, 27 Jun 2024 22:15:04 +0800

Hi,

On 6/27/2024 11:34 AM, Alexei Starovoitov wrote:
> On Tue, Jun 25, 2024 at 9:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>> Hi,
>>
>> On 6/26/2024 10:06 AM, Alexei Starovoitov wrote:
>>> On Mon, Jun 24, 2024 at 7:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> Sorry to resurrect the old thread to continue the discussion of APIs for
>>>> qp-trie.
>>>>

SNIP
>>>> (3) libbpf API
>>>>
>>>> Add bpf_map__get_next_sized_key() to high level APIs.
>>>>
>>>> LIBBPF_API int bpf_map__get_next_sized_key(const struct bpf_map *map,
>>>>                                            const void *cur_key,
>>>>                                            size_t cur_key_sz,
>>>>                                            void *next_key, size_t
>>>> *next_key_sz);
>>>>
>>>> Add
>>>> bpf_map_update_sized_elem()/bpf_map_lookup_sized_elem()/bpf_map_delete_sized_elem()/bpf_map_get_next_sized_key()
>>>> to low level APIs.
>>>> These APIs have already considered the case in which map has
>>>> variable-size value, so there will be no need to add other new APIs to
>>>> support such case.
>>>>
>>>> LIBBPF_API int bpf_map_update_sized_elem(int fd, const void *key, size_t
>>>> key_sz,
>>>>                                          const void *value, size_t value_sz,
>>>>                                          __u64 flags);
>>>> LIBBPF_API int bpf_map_lookup_sized_elem(int fd, const void *key, size_t
>>>> key_sz,
>>>>                                          void *value, size_t *value_sz,
>>>>                                          __u64 flags);
>>>> LIBBPF_API int bpf_map_delete_sized_elem(int fd, const void *key, size_t
>>>> key_sz,
>>>>                                          __u64 flags);
>>>> LIBBPF_API int bpf_map_get_next_sized_key(int fd,
>>>>                                           const void *key, size_t key_sz,
>>>>                                           void *next_key, size_t
>>>> *next_key_sz);
>>> I don't like this approach.
>>> It looks messy to me and solving one specific case where
>>> key/value is a blob of bytes.
>>> In other words it's taking api to pre-BTF days when everything
>>> was an opaque blob.
>> I see.
>>> I think we need a new object dynptr-like that is composable with other types.
>>> So that user can say that key is
>>> struct map_key {
>>>    long foo;
>>>    dynptr_like array;
>>>    int bar;
>>> };
>>>
>>> I'm not sure whether the existing bpf_dynptr fits exactly, but it's
>>> close enough.
>>> Such dynptr_like object should be able to be used as a string.
>>> And map should allow two such strings:
>>> struct map_key {
>>>    dynptr_like file_name;
>>>    dynptr_like dir;
>>> };
>>>
>>> and BTF for such map should see distinguish it as two strings
>>> and not as a single blob of bytes.
>>> The observability of bpf maps with bpftool should be able to print it.
>>>
>>> The use of such api will look the same from bpf prog and from user space.
>>> bpf prog can do:
>>>
>>>  struct map_key key;
>>>  bpf_dynptr_from_whatever(&key.file_name, ...);
>>>  bpf_dynptr_from_whatever(&key.dir, ...);
>>>  bpf_map_lookup_elem(map, &key);
>>>
>>> and similar from user space.
>>> bpf_dynptr_user will be a struct with size and a pointer.
>>> The existing sys_bpf commands will stay as-is.
>>> The user space will do:
>>>
>>> struct map_key {
>>>    bpf_dynptr_user file_name;
>>>    bpf_dynptr_user dir;
>>> } key;
>>>
>>> key.dir.size = 1000;
>>> key.dir.ptr = malloc(1000);
>>> ...
>>> bpf_map_lookup_elem( &key); // existing syscall cmd
>>>
>>> In this case sizeof(struct map_key) == sizeof(bpf_dynptr_user) * 2 == 32
>>>
>>> Both for bpf prog and for user space.
>> It seems the idea could be implemented through both hash-table and qp-trie.
> yes. I think so.
>
>> For hash-table, firstly we need to keep each offset of these dynptr_like
>> objects. During update operation, we need to calculate the hash for each
>> dynptr_like object and combine these hashes into a new hash. During
>> lookup, we need to compare each dynptr_like object alone to check
>> whether or not it is the same as the target element.
> yep.
> We already have btf_field/btf_record infra that describe map value.
> They can be used to describe map key just as well.
> The tricky part would be to make the whole hash of dynptr-s and compare
> quick enough without hurting common use case of traditional hash map.
>
>> For qp-trie, we also need to keep the offset for each dynptr_like
>> object. During update operation, we should marshal the passed key into a
>> plain blob and save the plain blob in qp-trie. During lookup, we don't
>> marshal the input key, instead we lookup up the qp-trie by using each
>> field in the map key step-wise.
> yes. exactly.
>
>> However for get_next_key operation, we
>> need to unmarshal the plain blob into a dynptr_like object.
> hmm. I haven't thought about get_next_key for qp-trie.
> A bit tricky indeed.
>
>> For the two hypothetical implementations above, I think the lookup
>> performance may be better than qp-trie and its memory usage will not be
>> bad, so I prefer to support dynptr_like object in hash map key first. WDYT ?
> I'd actually do it with qp-trie first.
> bpftrace is using strings as keys a lot and some scripts use "multi key" too
> (like string plus int)
> Currently bpftrace sizes up hash key_size to max and it's not efficient.
> I think qp-trie with multi-key support would work very well for that use case.
>
> iirc early version of qp-trie was limited to nul-terminated strings
> as a key, but I believe it's possible to tweak the algorithm to support
> binary key. Then what you describing above how lookup/update of a key
> will work nicely.
> I also suspect that lookup will be much faster in qp-trie compared
> to hash table, hence that's what I would implement first.

The first version of qp-trie had already supported to use arbitrary
bytes as key [1]. The lookup performance comparison between qp-trie and
hash-table varies according to the benchmark result in the early
patch-set [1]. For normal strings (e.g., strings in BTF or kallsyms),
hash-table performance better. I will try whether or not it is possible
to work out a hack version for both hash-table and qp-trie to compare
the lookup performance first.

[1]:
https://lore.kernel.org/bpf/20220726130005.3102470-1-houtao1@xxxxxxxxxx/