On Tue, Jun 25, 2024 at 9:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 6/26/2024 10:06 AM, Alexei Starovoitov wrote: > > On Mon, Jun 24, 2024 at 7:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > >> Hi, > >> > >> Sorry to resurrect the old thread to continue the discussion of APIs for > >> qp-trie. > >> > >> On 8/26/2023 2:33 AM, Andrii Nakryiko wrote: > >>> On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > >>>> Hi, > >>>> > >> SNIP > >> > >>>> updated to allow using dynptr as map key for qp-trie. > >>>>> And that's the problem I just mentioned. > >>>>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also > >>>>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type). > >>>> Sorry for misunderstanding your reply. But before switch to the kfuncl > >>>> way, could you please point me to some code or function which shows the > >>>> specialty of PTR_MAP_KEY ? > >>>> > >>>> > >>> Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The > >>> logic assumes that there is associated struct bpf_map * pointer from > >>> which we know fixed-sized key length. > >>> > >>> But getting back to the topic at hand. I vaguely remember discussion > >>> we had, but it would be good if you could summarize it again here to > >>> avoid talking past each other. What is the bpf_map_ops changes you > >>> were thinking to do? How bpf_attr will look like? How BPF-side API for > >>> lookup/delete/update will look like? And then let's go from there? > >>> Thanks! > >>> > >>> . > >> The APIs for qp-trie are composed of the followings 5 parts: > >> > >> (1) map definition for qp-trie > >> > >> The key is bpf_dynptr and map_extra specifies the max length of key. > >> > >> struct { > >> __uint(type, BPF_MAP_TYPE_QP_TRIE); > >> __type(key, struct bpf_dynptr); > >> __type(value, unsigned int); > >> __uint(map_flags, BPF_F_NO_PREALLOC); > >> __uint(map_extra, 1024); > >> } qp_trie SEC(".maps"); > >> > >> (2) bpf_attr > >> > >> Add key_sz & next_key_sz into anonymous struct to support map with > >> variable-size key. We could add value_sz if the map with variable-size > >> value is supported in the future. > >> > >> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ > >> __u32 map_fd; > >> __aligned_u64 key; > >> union { > >> __aligned_u64 value; > >> __aligned_u64 next_key; > >> }; > >> __u64 flags; > >> __u32 key_sz; > >> __u32 next_key_sz; > >> }; > >> > >> (3) libbpf API > >> > >> Add bpf_map__get_next_sized_key() to high level APIs. > >> > >> LIBBPF_API int bpf_map__get_next_sized_key(const struct bpf_map *map, > >> const void *cur_key, > >> size_t cur_key_sz, > >> void *next_key, size_t > >> *next_key_sz); > >> > >> Add > >> bpf_map_update_sized_elem()/bpf_map_lookup_sized_elem()/bpf_map_delete_sized_elem()/bpf_map_get_next_sized_key() > >> to low level APIs. > >> These APIs have already considered the case in which map has > >> variable-size value, so there will be no need to add other new APIs to > >> support such case. > >> > >> LIBBPF_API int bpf_map_update_sized_elem(int fd, const void *key, size_t > >> key_sz, > >> const void *value, size_t value_sz, > >> __u64 flags); > >> LIBBPF_API int bpf_map_lookup_sized_elem(int fd, const void *key, size_t > >> key_sz, > >> void *value, size_t *value_sz, > >> __u64 flags); > >> LIBBPF_API int bpf_map_delete_sized_elem(int fd, const void *key, size_t > >> key_sz, > >> __u64 flags); > >> LIBBPF_API int bpf_map_get_next_sized_key(int fd, > >> const void *key, size_t key_sz, > >> void *next_key, size_t > >> *next_key_sz); > > I don't like this approach. > > It looks messy to me and solving one specific case where > > key/value is a blob of bytes. > > In other words it's taking api to pre-BTF days when everything > > was an opaque blob. > > I see. > > I think we need a new object dynptr-like that is composable with other types. > > So that user can say that key is > > struct map_key { > > long foo; > > dynptr_like array; > > int bar; > > }; > > > > I'm not sure whether the existing bpf_dynptr fits exactly, but it's > > close enough. > > Such dynptr_like object should be able to be used as a string. > > And map should allow two such strings: > > struct map_key { > > dynptr_like file_name; > > dynptr_like dir; > > }; > > > > and BTF for such map should see distinguish it as two strings > > and not as a single blob of bytes. > > The observability of bpf maps with bpftool should be able to print it. > > > > The use of such api will look the same from bpf prog and from user space. > > bpf prog can do: > > > > struct map_key key; > > bpf_dynptr_from_whatever(&key.file_name, ...); > > bpf_dynptr_from_whatever(&key.dir, ...); > > bpf_map_lookup_elem(map, &key); > > > > and similar from user space. > > bpf_dynptr_user will be a struct with size and a pointer. > > The existing sys_bpf commands will stay as-is. > > The user space will do: > > > > struct map_key { > > bpf_dynptr_user file_name; > > bpf_dynptr_user dir; > > } key; > > > > key.dir.size = 1000; > > key.dir.ptr = malloc(1000); > > ... > > bpf_map_lookup_elem( &key); // existing syscall cmd > > > > In this case sizeof(struct map_key) == sizeof(bpf_dynptr_user) * 2 == 32 > > > > Both for bpf prog and for user space. > > It seems the idea could be implemented through both hash-table and qp-trie. yes. I think so. > For hash-table, firstly we need to keep each offset of these dynptr_like > objects. During update operation, we need to calculate the hash for each > dynptr_like object and combine these hashes into a new hash. During > lookup, we need to compare each dynptr_like object alone to check > whether or not it is the same as the target element. yep. We already have btf_field/btf_record infra that describe map value. They can be used to describe map key just as well. The tricky part would be to make the whole hash of dynptr-s and compare quick enough without hurting common use case of traditional hash map. > For qp-trie, we also need to keep the offset for each dynptr_like > object. During update operation, we should marshal the passed key into a > plain blob and save the plain blob in qp-trie. During lookup, we don't > marshal the input key, instead we lookup up the qp-trie by using each > field in the map key step-wise. yes. exactly. > However for get_next_key operation, we > need to unmarshal the plain blob into a dynptr_like object. hmm. I haven't thought about get_next_key for qp-trie. A bit tricky indeed. > For the two hypothetical implementations above, I think the lookup > performance may be better than qp-trie and its memory usage will not be > bad, so I prefer to support dynptr_like object in hash map key first. WDYT ? I'd actually do it with qp-trie first. bpftrace is using strings as keys a lot and some scripts use "multi key" too (like string plus int) Currently bpftrace sizes up hash key_size to max and it's not efficient. I think qp-trie with multi-key support would work very well for that use case. iirc early version of qp-trie was limited to nul-terminated strings as a key, but I believe it's possible to tweak the algorithm to support binary key. Then what you describing above how lookup/update of a key will work nicely. I also suspect that lookup will be much faster in qp-trie compared to hash table, hence that's what I would implement first.