Re: [PATCH bpf-next v2 19/25] bpf: Introduce bpf_kptr_new

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 19 Oct 2022 18:11:30 -0700

On Wed, Oct 19, 2022 at 5:44 PM Kumar Kartikeya Dwivedi
<memxor@xxxxxxxxx> wrote:
>
> On Wed, Oct 19, 2022 at 10:01:21PM IST, Alexei Starovoitov wrote:
> > On Tue, Oct 18, 2022 at 10:58 PM Kumar Kartikeya Dwivedi
> > <memxor@xxxxxxxxx> wrote:
> > >
> > > On Wed, Oct 19, 2022 at 08:01:24AM IST, Alexei Starovoitov wrote:
> > > > On Thu, Oct 13, 2022 at 11:52:57AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > > +void *bpf_kptr_new_impl(u64 local_type_id__k, u64 flags, void *meta__ign)
> > > > > +{
> > > > > +   struct btf_struct_meta *meta = meta__ign;
> > > > > +   u64 size = local_type_id__k;
> > > > > +   void *p;
> > > > > +
> > > > > +   if (unlikely(flags || !bpf_global_ma_set))
> > > > > +           return NULL;
> > > >
> > > > Unused 'flags' looks weird in unstable api. Just drop it?
> > > > And keep it as:
> > > > void *bpf_kptr_new(u64 local_type_id__k, struct btf_struct_meta *meta__ign);
> > > >
> > > > and in bpf_experimental.h:
> > > >
> > > > extern void *bpf_kptr_new(__u64 local_type_id) __ksym;
> > > >
> > > > since __ign args are ignored during kfunc type match
> > > > the bpf progs can use it without #define.
> > > >
> > >
> > > It's ignored during check_kfunc_call, but libbpf doesn't ignore that. The
> > > prototypes will not be the same. I guess I'll have to teach it do that during
> > > type match, but IDK how you feel about that.
> >
> > libbpf does the full type match, really?
> > Could you point me to the code?
> >
>
> Not full type match, but the number of arguments must be same, so it won't allow
> having kfunc as:
>
> void *bpf_kptr_new(u64 local_type_id__k, struct btf_struct_meta *meta__ign);
>
> in the kernel and ksym declaration in the program as:
>
> extern void *bpf_kptr_new(__u64 local_type_id) __ksym;
>
> I get:
>
> libbpf: extern (func ksym) 'bpf_kptr_new_impl': func_proto [25] incompatible with kernel [60043]
>
> vlen of func_proto in kernel type is 2, for us it will be 1.

Ahh. Found it.
__bpf_core_types_are_compat() runs in both kernel and libbpf.
We could hack it for this case, but let's not complicate it.

> > > Otherwise unless you want people to manually pass something to the ignored
> > > argument, we have to hide it behind a macro.
> > >
> > > I actually like the macro on top, then I don't even pass the type ID but the
> > > type. But that's a personal preference, and I don't feel strongly about it.
> > >
> > > So in C one does malloc(sizeof(*p)), here we'll just write
> > > bpf_kptr_new(typeof(*p)). YMMV.
> >
> > bpf_kptr_new(typeof(*p)) is cleaner.
> >
>
> So if we're having a macro anyway, the thing above can be hidden behind it.

Indeed. Since #define is there let's not mess with
__bpf_core_types_are_compat().

> > > > > +   p = bpf_mem_alloc(&bpf_global_ma, size);
> > > > > +   if (!p)
> > > > > +           return NULL;
> > > > > +   if (meta)
> > > > > +           bpf_obj_init(meta->off_arr, p);
> > > >
> > > > I'm starting to dislike all that _arr and _tab suffixes in the verifier code base.
> > > > It reminds me of programming style where people tried to add types into
> > > > variable names. imo dropping _arr wouldn't be just fine.
> > >
> > > Ack, I'll do it in v3.
> > >
> > > Also, I'd like to invite people to please bikeshed a bit over the naming of the
> > > APIs, e.g. whether it should be bpf_kptr_drop vs bpf_kptr_delete.
> >
> > bpf_kptr_drop is more precise.
> > delete assumes instant free which is not the case here.
> >
> > How about
> > extern void *__bpf_obj_new(__u64 local_type_id) __ksym;
> > extern void bpf_obj_drop(void *obj) __ksym;
> > #define bpf_obj_new(t) \
> >  (t *)__bpf_obj_new(bpf_core_type_id_local(t));
> >
> > kptr means 'kernel pointer'.
> > Here we have program supplied object.
> > It feels 'obj' is better than 'kptr' in this context.
> >
>
> I agree, I'll rename it to bpf_obj_*.
>
> Also, that __bpf_obj_new doesn't work yet in clang [0] but I think we can rename
> it to that once clang is fixed.

argh. ok. let's keep bpf_obj_new_impl() for now.

>
>  [0]: https://reviews.llvm.org/D136041
>
> > > In the BPF list API, it's named bpf_list_del but it's actually distinct from how
> > > list_del in the kernel works. So it does make sense to give them a different
> > > name (like pop_front/pop_back and push_front/push_back)?
> > >
> > > Because even bpf_list_add takes bpf_list_head, in the kernel there's no
> > > distinction between node and head, so you can do list_add on a node as well, but
> > > it won't be possible with the kfunc (unless we overload the head argument to
> > > also work with nodes).
> > >
> > > Later we'll probably have to add bpf_list_node_add etc. that add before or after
> > > a node to make that work.
> > >
> > > The main question is whether it should closely resembly the linked list API in
> > > the kernel, or can it steer away considerably from that?
> >
> > If we do doubly linked list we should allow delete in
> > the middle with
> > bpf_list_del_any(head, node)
> >
> > and
> > bpf_list_pop_front/pop_back(head)
> >
> > bpf_list_add(node, head) would match kernel style,
> > but I think it's cleaner to have head as 1st arg.
> > In that sense new pop/push/_front/_back are cleaner.
> > And similar for rbtree.
> >
> > If we keep (node, head) and (rb_node, rb_root) order
> > we should keep kernel names.
>
> There's also tradeoffs in how various operations are done.
>
> Right now we have bpf_list_del and bpf_list_del_tail doing pop_front and
> pop_back.
>
> To replicate the same in kernel style API, you would do:
>
> struct foo *f = bpf_list_first_entry_or_null(head);
> if (!f) {}
> bpf_list_del(&f->node);
>
> Between those two calls, you might do bpf_list_last_entry -> bpf_list_del, the
> verifier is going to have a hard time proving the aliasing of the two nodes, so
> it will have to invalidate all pointers peeking into the list whenever something
> modifies it. You would still be able to access memory but cannot pass it to list
> ops anymore.

yeah. ugly.

>
> But I think we will pay this cost eventually anyway, people will add a peek
> operation and such analysis would have to be done then. So I think it's unlikely
> we can avoid this, and it might be better to make things more consistent and end
> up mirroring the kernel list APIs.

Let's keep the verifier simpler for now.
We can go with pop/push/_front/_back, since they are simpler
and later add kernel style accessor as well with swapped
node/head order.
One doesn't preclude the other.