Re: APIs for qp-trie //Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 26 Jun 2024 20:34:07 -0700

On Tue, Jun 25, 2024 at 9:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 6/26/2024 10:06 AM, Alexei Starovoitov wrote:
> > On Mon, Jun 24, 2024 at 7:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >> Hi,
> >>
> >> Sorry to resurrect the old thread to continue the discussion of APIs for
> >> qp-trie.
> >>
> >> On 8/26/2023 2:33 AM, Andrii Nakryiko wrote:
> >>> On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >>>> Hi,
> >>>>
> >> SNIP
> >>
> >>>> updated to allow using dynptr as map key for qp-trie.
> >>>>> And that's the problem I just mentioned.
> >>>>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
> >>>>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
> >>>> Sorry for misunderstanding your reply. But before switch to the kfuncl
> >>>> way, could you please point me to some code or function which shows the
> >>>> specialty of PTR_MAP_KEY ?
> >>>>
> >>>>
> >>> Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
> >>> logic assumes that there is associated struct bpf_map * pointer from
> >>> which we know fixed-sized key length.
> >>>
> >>> But getting back to the topic at hand. I vaguely remember discussion
> >>> we had, but it would be good if you could summarize it again here to
> >>> avoid talking past each other. What is the bpf_map_ops changes you
> >>> were thinking to do? How bpf_attr will look like? How BPF-side API for
> >>> lookup/delete/update will look like? And then let's go from there?
> >>> Thanks!
> >>>
> >>> .
> >> The APIs for qp-trie are composed of the followings 5 parts:
> >>
> >> (1) map definition for qp-trie
> >>
> >> The key is bpf_dynptr and map_extra specifies the max length of key.
> >>
> >> struct {
> >>     __uint(type, BPF_MAP_TYPE_QP_TRIE);
> >>     __type(key, struct bpf_dynptr);
> >>     __type(value, unsigned int);
> >>     __uint(map_flags, BPF_F_NO_PREALLOC);
> >>     __uint(map_extra, 1024);
> >> } qp_trie SEC(".maps");
> >>
> >> (2) bpf_attr
> >>
> >> Add key_sz & next_key_sz into anonymous struct to support map with
> >> variable-size key. We could add value_sz if the map with variable-size
> >> value is supported in the future.
> >>
> >>         struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> >>                 __u32           map_fd;
> >>                 __aligned_u64   key;
> >>                 union {
> >>                         __aligned_u64 value;
> >>                         __aligned_u64 next_key;
> >>                 };
> >>                 __u64           flags;
> >>                 __u32           key_sz;
> >>                 __u32           next_key_sz;
> >>         };
> >>
> >> (3) libbpf API
> >>
> >> Add bpf_map__get_next_sized_key() to high level APIs.
> >>
> >> LIBBPF_API int bpf_map__get_next_sized_key(const struct bpf_map *map,
> >>                                            const void *cur_key,
> >>                                            size_t cur_key_sz,
> >>                                            void *next_key, size_t
> >> *next_key_sz);
> >>
> >> Add
> >> bpf_map_update_sized_elem()/bpf_map_lookup_sized_elem()/bpf_map_delete_sized_elem()/bpf_map_get_next_sized_key()
> >> to low level APIs.
> >> These APIs have already considered the case in which map has
> >> variable-size value, so there will be no need to add other new APIs to
> >> support such case.
> >>
> >> LIBBPF_API int bpf_map_update_sized_elem(int fd, const void *key, size_t
> >> key_sz,
> >>                                          const void *value, size_t value_sz,
> >>                                          __u64 flags);
> >> LIBBPF_API int bpf_map_lookup_sized_elem(int fd, const void *key, size_t
> >> key_sz,
> >>                                          void *value, size_t *value_sz,
> >>                                          __u64 flags);
> >> LIBBPF_API int bpf_map_delete_sized_elem(int fd, const void *key, size_t
> >> key_sz,
> >>                                          __u64 flags);
> >> LIBBPF_API int bpf_map_get_next_sized_key(int fd,
> >>                                           const void *key, size_t key_sz,
> >>                                           void *next_key, size_t
> >> *next_key_sz);
> > I don't like this approach.
> > It looks messy to me and solving one specific case where
> > key/value is a blob of bytes.
> > In other words it's taking api to pre-BTF days when everything
> > was an opaque blob.
>
> I see.
> > I think we need a new object dynptr-like that is composable with other types.
> > So that user can say that key is
> > struct map_key {
> >    long foo;
> >    dynptr_like array;
> >    int bar;
> > };
> >
> > I'm not sure whether the existing bpf_dynptr fits exactly, but it's
> > close enough.
> > Such dynptr_like object should be able to be used as a string.
> > And map should allow two such strings:
> > struct map_key {
> >    dynptr_like file_name;
> >    dynptr_like dir;
> > };
> >
> > and BTF for such map should see distinguish it as two strings
> > and not as a single blob of bytes.
> > The observability of bpf maps with bpftool should be able to print it.
> >
> > The use of such api will look the same from bpf prog and from user space.
> > bpf prog can do:
> >
> >  struct map_key key;
> >  bpf_dynptr_from_whatever(&key.file_name, ...);
> >  bpf_dynptr_from_whatever(&key.dir, ...);
> >  bpf_map_lookup_elem(map, &key);
> >
> > and similar from user space.
> > bpf_dynptr_user will be a struct with size and a pointer.
> > The existing sys_bpf commands will stay as-is.
> > The user space will do:
> >
> > struct map_key {
> >    bpf_dynptr_user file_name;
> >    bpf_dynptr_user dir;
> > } key;
> >
> > key.dir.size = 1000;
> > key.dir.ptr = malloc(1000);
> > ...
> > bpf_map_lookup_elem( &key); // existing syscall cmd
> >
> > In this case sizeof(struct map_key) == sizeof(bpf_dynptr_user) * 2 == 32
> >
> > Both for bpf prog and for user space.
>
> It seems the idea could be implemented through both hash-table and qp-trie.

yes. I think so.

> For hash-table, firstly we need to keep each offset of these dynptr_like
> objects. During update operation, we need to calculate the hash for each
> dynptr_like object and combine these hashes into a new hash. During
> lookup, we need to compare each dynptr_like object alone to check
> whether or not it is the same as the target element.

yep.
We already have btf_field/btf_record infra that describe map value.
They can be used to describe map key just as well.
The tricky part would be to make the whole hash of dynptr-s and compare
quick enough without hurting common use case of traditional hash map.

> For qp-trie, we also need to keep the offset for each dynptr_like
> object. During update operation, we should marshal the passed key into a
> plain blob and save the plain blob in qp-trie. During lookup, we don't
> marshal the input key, instead we lookup up the qp-trie by using each
> field in the map key step-wise.

yes. exactly.

> However for get_next_key operation, we
> need to unmarshal the plain blob into a dynptr_like object.

hmm. I haven't thought about get_next_key for qp-trie.
A bit tricky indeed.

> For the two hypothetical implementations above, I think the lookup
> performance may be better than qp-trie and its memory usage will not be
> bad, so I prefer to support dynptr_like object in hash map key first. WDYT ?

I'd actually do it with qp-trie first.
bpftrace is using strings as keys a lot and some scripts use "multi key" too
(like string plus int)
Currently bpftrace sizes up hash key_size to max and it's not efficient.
I think qp-trie with multi-key support would work very well for that use case.

iirc early version of qp-trie was limited to nul-terminated strings
as a key, but I believe it's possible to tweak the algorithm to support
binary key. Then what you describing above how lookup/update of a key
will work nicely.
I also suspect that lookup will be much faster in qp-trie compared
to hash table, hence that's what I would implement first.