Re: APIs for qp-trie //Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 26 Jun 2024 20:35:13 -0700

On Wed, Jun 26, 2024 at 5:02 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Tue, Jun 25, 2024 at 9:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On 6/26/2024 10:06 AM, Alexei Starovoitov wrote:
> > > On Mon, Jun 24, 2024 at 7:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> > >> Hi,
> > >>
> > >> Sorry to resurrect the old thread to continue the discussion of APIs for
> > >> qp-trie.
> > >>
> > >> On 8/26/2023 2:33 AM, Andrii Nakryiko wrote:
> > >>> On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
> > >>>> Hi,
> > >>>>
> > >> SNIP
> > >>
> > >>>> updated to allow using dynptr as map key for qp-trie.
> > >>>>> And that's the problem I just mentioned.
> > >>>>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
> > >>>>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
> > >>>> Sorry for misunderstanding your reply. But before switch to the kfuncl
> > >>>> way, could you please point me to some code or function which shows the
> > >>>> specialty of PTR_MAP_KEY ?
> > >>>>
> > >>>>
> > >>> Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
> > >>> logic assumes that there is associated struct bpf_map * pointer from
> > >>> which we know fixed-sized key length.
> > >>>
> > >>> But getting back to the topic at hand. I vaguely remember discussion
> > >>> we had, but it would be good if you could summarize it again here to
> > >>> avoid talking past each other. What is the bpf_map_ops changes you
> > >>> were thinking to do? How bpf_attr will look like? How BPF-side API for
> > >>> lookup/delete/update will look like? And then let's go from there?
> > >>> Thanks!
> > >>>
> > >>> .
> > >> The APIs for qp-trie are composed of the followings 5 parts:
> > >>
> > >> (1) map definition for qp-trie
> > >>
> > >> The key is bpf_dynptr and map_extra specifies the max length of key.
> > >>
> > >> struct {
> > >>     __uint(type, BPF_MAP_TYPE_QP_TRIE);
> > >>     __type(key, struct bpf_dynptr);
> > >>     __type(value, unsigned int);
> > >>     __uint(map_flags, BPF_F_NO_PREALLOC);
> > >>     __uint(map_extra, 1024);
> > >> } qp_trie SEC(".maps");
> > >>
> > >> (2) bpf_attr
> > >>
> > >> Add key_sz & next_key_sz into anonymous struct to support map with
> > >> variable-size key. We could add value_sz if the map with variable-size
> > >> value is supported in the future.
> > >>
> > >>         struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> > >>                 __u32           map_fd;
> > >>                 __aligned_u64   key;
> > >>                 union {
> > >>                         __aligned_u64 value;
> > >>                         __aligned_u64 next_key;
> > >>                 };
> > >>                 __u64           flags;
> > >>                 __u32           key_sz;
> > >>                 __u32           next_key_sz;
> > >>         };
> > >>
> > >> (3) libbpf API
> > >>
> > >> Add bpf_map__get_next_sized_key() to high level APIs.
> > >>
> > >> LIBBPF_API int bpf_map__get_next_sized_key(const struct bpf_map *map,
> > >>                                            const void *cur_key,
> > >>                                            size_t cur_key_sz,
> > >>                                            void *next_key, size_t
> > >> *next_key_sz);
> > >>
> > >> Add
> > >> bpf_map_update_sized_elem()/bpf_map_lookup_sized_elem()/bpf_map_delete_sized_elem()/bpf_map_get_next_sized_key()
> > >> to low level APIs.
> > >> These APIs have already considered the case in which map has
> > >> variable-size value, so there will be no need to add other new APIs to
> > >> support such case.
> > >>
> > >> LIBBPF_API int bpf_map_update_sized_elem(int fd, const void *key, size_t
> > >> key_sz,
> > >>                                          const void *value, size_t value_sz,
> > >>                                          __u64 flags);
> > >> LIBBPF_API int bpf_map_lookup_sized_elem(int fd, const void *key, size_t
> > >> key_sz,
> > >>                                          void *value, size_t *value_sz,
> > >>                                          __u64 flags);
> > >> LIBBPF_API int bpf_map_delete_sized_elem(int fd, const void *key, size_t
> > >> key_sz,
> > >>                                          __u64 flags);
> > >> LIBBPF_API int bpf_map_get_next_sized_key(int fd,
> > >>                                           const void *key, size_t key_sz,
> > >>                                           void *next_key, size_t
> > >> *next_key_sz);
> > > I don't like this approach.
> > > It looks messy to me and solving one specific case where
> > > key/value is a blob of bytes.
> > > In other words it's taking api to pre-BTF days when everything
> > > was an opaque blob.
> >
> > I see.
> > > I think we need a new object dynptr-like that is composable with other types.
> > > So that user can say that key is
> > > struct map_key {
> > >    long foo;
> > >    dynptr_like array;
> > >    int bar;
> > > };
> > >
> > > I'm not sure whether the existing bpf_dynptr fits exactly, but it's
> > > close enough.
> > > Such dynptr_like object should be able to be used as a string.
> > > And map should allow two such strings:
> > > struct map_key {
> > >    dynptr_like file_name;
> > >    dynptr_like dir;
> > > };
> > >
> > > and BTF for such map should see distinguish it as two strings
> > > and not as a single blob of bytes.
> > > The observability of bpf maps with bpftool should be able to print it.
> > >
> > > The use of such api will look the same from bpf prog and from user space.
> > > bpf prog can do:
> > >
> > >  struct map_key key;
> > >  bpf_dynptr_from_whatever(&key.file_name, ...);
> > >  bpf_dynptr_from_whatever(&key.dir, ...);
> > >  bpf_map_lookup_elem(map, &key);
> > >
> > > and similar from user space.
> > > bpf_dynptr_user will be a struct with size and a pointer.
> > > The existing sys_bpf commands will stay as-is.
> > > The user space will do:
> > >
> > > struct map_key {
> > >    bpf_dynptr_user file_name;
> > >    bpf_dynptr_user dir;
> > > } key;
> > >
> > > key.dir.size = 1000;
> > > key.dir.ptr = malloc(1000);
> > > ...
> > > bpf_map_lookup_elem( &key); // existing syscall cmd
> > >
> > > In this case sizeof(struct map_key) == sizeof(bpf_dynptr_user) * 2 == 32
> > >
> > > Both for bpf prog and for user space.
> >
> > It seems the idea could be implemented through both hash-table and qp-trie.
> >
> > For hash-table, firstly we need to keep each offset of these dynptr_like
> > objects. During update operation, we need to calculate the hash for each
> > dynptr_like object and combine these hashes into a new hash. During
> > lookup, we need to compare each dynptr_like object alone to check
> > whether or not it is the same as the target element.
> >
> > For qp-trie, we also need to keep the offset for each dynptr_like
> > object. During update operation, we should marshal the passed key into a
> > plain blob and save the plain blob in qp-trie. During lookup, we don't
> > marshal the input key, instead we lookup up the qp-trie by using each
> > field in the map key step-wise. However for get_next_key operation, we
> > need to unmarshal the plain blob into a dynptr_like object.
> >
> > For the two hypothetical implementations above, I think the lookup
> > performance may be better than qp-trie and its memory usage will not be
> > bad, so I prefer to support dynptr_like object in hash map key first. WDYT ?
> >
>
> These nested variable-sized array fields are not really compatible
> with qp-trie (or any trie data structure) to begin with. I think this
> would be compatible only with hash-based implementations.

I don't see why not. qp-trie can support binary key and Hou approach
sounds very doable.