Hi, On 6/27/2024 11:34 AM, Alexei Starovoitov wrote: > On Tue, Jun 25, 2024 at 9:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >> Hi, >> >> On 6/26/2024 10:06 AM, Alexei Starovoitov wrote: >>> On Mon, Jun 24, 2024 at 7:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: >>>> Hi, >>>> >>>> Sorry to resurrect the old thread to continue the discussion of APIs for >>>> qp-trie. >>>> SNIP >>>> (3) libbpf API >>>> >>>> Add bpf_map__get_next_sized_key() to high level APIs. >>>> >>>> LIBBPF_API int bpf_map__get_next_sized_key(const struct bpf_map *map, >>>> const void *cur_key, >>>> size_t cur_key_sz, >>>> void *next_key, size_t >>>> *next_key_sz); >>>> >>>> Add >>>> bpf_map_update_sized_elem()/bpf_map_lookup_sized_elem()/bpf_map_delete_sized_elem()/bpf_map_get_next_sized_key() >>>> to low level APIs. >>>> These APIs have already considered the case in which map has >>>> variable-size value, so there will be no need to add other new APIs to >>>> support such case. >>>> >>>> LIBBPF_API int bpf_map_update_sized_elem(int fd, const void *key, size_t >>>> key_sz, >>>> const void *value, size_t value_sz, >>>> __u64 flags); >>>> LIBBPF_API int bpf_map_lookup_sized_elem(int fd, const void *key, size_t >>>> key_sz, >>>> void *value, size_t *value_sz, >>>> __u64 flags); >>>> LIBBPF_API int bpf_map_delete_sized_elem(int fd, const void *key, size_t >>>> key_sz, >>>> __u64 flags); >>>> LIBBPF_API int bpf_map_get_next_sized_key(int fd, >>>> const void *key, size_t key_sz, >>>> void *next_key, size_t >>>> *next_key_sz); >>> I don't like this approach. >>> It looks messy to me and solving one specific case where >>> key/value is a blob of bytes. >>> In other words it's taking api to pre-BTF days when everything >>> was an opaque blob. >> I see. >>> I think we need a new object dynptr-like that is composable with other types. >>> So that user can say that key is >>> struct map_key { >>> long foo; >>> dynptr_like array; >>> int bar; >>> }; >>> >>> I'm not sure whether the existing bpf_dynptr fits exactly, but it's >>> close enough. >>> Such dynptr_like object should be able to be used as a string. >>> And map should allow two such strings: >>> struct map_key { >>> dynptr_like file_name; >>> dynptr_like dir; >>> }; >>> >>> and BTF for such map should see distinguish it as two strings >>> and not as a single blob of bytes. >>> The observability of bpf maps with bpftool should be able to print it. >>> >>> The use of such api will look the same from bpf prog and from user space. >>> bpf prog can do: >>> >>> struct map_key key; >>> bpf_dynptr_from_whatever(&key.file_name, ...); >>> bpf_dynptr_from_whatever(&key.dir, ...); >>> bpf_map_lookup_elem(map, &key); >>> >>> and similar from user space. >>> bpf_dynptr_user will be a struct with size and a pointer. >>> The existing sys_bpf commands will stay as-is. >>> The user space will do: >>> >>> struct map_key { >>> bpf_dynptr_user file_name; >>> bpf_dynptr_user dir; >>> } key; >>> >>> key.dir.size = 1000; >>> key.dir.ptr = malloc(1000); >>> ... >>> bpf_map_lookup_elem( &key); // existing syscall cmd >>> >>> In this case sizeof(struct map_key) == sizeof(bpf_dynptr_user) * 2 == 32 >>> >>> Both for bpf prog and for user space. >> It seems the idea could be implemented through both hash-table and qp-trie. > yes. I think so. > >> For hash-table, firstly we need to keep each offset of these dynptr_like >> objects. During update operation, we need to calculate the hash for each >> dynptr_like object and combine these hashes into a new hash. During >> lookup, we need to compare each dynptr_like object alone to check >> whether or not it is the same as the target element. > yep. > We already have btf_field/btf_record infra that describe map value. > They can be used to describe map key just as well. > The tricky part would be to make the whole hash of dynptr-s and compare > quick enough without hurting common use case of traditional hash map. > >> For qp-trie, we also need to keep the offset for each dynptr_like >> object. During update operation, we should marshal the passed key into a >> plain blob and save the plain blob in qp-trie. During lookup, we don't >> marshal the input key, instead we lookup up the qp-trie by using each >> field in the map key step-wise. > yes. exactly. > >> However for get_next_key operation, we >> need to unmarshal the plain blob into a dynptr_like object. > hmm. I haven't thought about get_next_key for qp-trie. > A bit tricky indeed. > >> For the two hypothetical implementations above, I think the lookup >> performance may be better than qp-trie and its memory usage will not be >> bad, so I prefer to support dynptr_like object in hash map key first. WDYT ? > I'd actually do it with qp-trie first. > bpftrace is using strings as keys a lot and some scripts use "multi key" too > (like string plus int) > Currently bpftrace sizes up hash key_size to max and it's not efficient. > I think qp-trie with multi-key support would work very well for that use case. > > iirc early version of qp-trie was limited to nul-terminated strings > as a key, but I believe it's possible to tweak the algorithm to support > binary key. Then what you describing above how lookup/update of a key > will work nicely. > I also suspect that lookup will be much faster in qp-trie compared > to hash table, hence that's what I would implement first. The first version of qp-trie had already supported to use arbitrary bytes as key [1]. The lookup performance comparison between qp-trie and hash-table varies according to the benchmark result in the early patch-set [1]. For normal strings (e.g., strings in BTF or kallsyms), hash-table performance better. I will try whether or not it is possible to work out a hack version for both hash-table and qp-trie to compare the lookup performance first. [1]: https://lore.kernel.org/bpf/20220726130005.3102470-1-houtao1@xxxxxxxxxx/