On Mon, Jan 29, 2024 at 10:39 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >> We can see how those limitations (calling sch_tree_lock() and returning a ptr) > >> can be addressed in bpf. This will also help other similar use cases. > >> > > > > For kptr, I wonder if we can support the following semantics in bpf if > > they make sense: > > I think they are useful but they are not fully supported now. > > Some thoughts below. > > > 1. Passing a referenced kptr into a bpf program, which will also need > > to be released, or exchanged into maps or allocated objects. > > "enqueue" should be the one considering here: > > struct Qdisc_ops { > /* ... */ > int (*enqueue)(struct sk_buff *skb, > struct Qdisc *sch, > struct sk_buff **to_free); > > }; > > The verifier only marks the skb as a trusted kptr but does not mark its > reg->ref_obj_id. Take a look at btf_ctx_access(). In particular: > > if (prog_args_trusted(prog)) > info->reg_type |= PTR_TRUSTED; > > The verifier does not know the skb ownership is passed into the ".enqueue" ops > and does not know the bpf prog needs to release it or store it in a map. > > The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just > an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at > acquire_reference_state() which is the useful one here. > > Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can > always acquire_reference_state() for the "struct sk_buff *skb" argument. > > Take a look at a recent RFC: > https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@xxxxxxxxx/ > which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch > is tagging the argument could be NULL by appending "__nullable" to the argument > name. The verifier will enforce that the bpf prog must check for NULL first. > > The similar idea can be used here but with a different tagging (for example, > "__must_release", admittedly not a good name). While the RFC patch is > in-progress, for now, may be hardcode for the ".enqueue" ops in > check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This > part can be adjusted later once the RFC patch will be in shape. > Make sense. One more thing to consider here is that .enqueue is actually a reference acquiring and releasing function at the same time. Assuming ctx written to by a struct_ops program can be seen by the kernel, another new tag for the "to_free" argument will still be needed so that the verifier can recognize when writing skb to "to_free". > > Then one more thing is to track when the struct_ops bpf prog is actually reading > the value of the skb pointer. One thing is worth to mention here, e.g. a > struct_ops prog for enqueue: > > SEC("struct_ops") > int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch, > struct sk_buff **to_free) > { > return bpf_qdisc_drop(skb, sch, to_free); > } > > Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array > of __u64 as the only argument. The skb is actually in ctx[0], sch is in > ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]), > btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize > the reg->ref_obj_id by the id obtained earlier from acquire_reference_state() > during check_struct_ops_btf_id() somehow. > > > > 2. Returning a kptr from a program and treating it as releasing the reference. > > e.g. for dequeue: > > struct Qdisc_ops { > /* ... */ > struct sk_buff * (*dequeue)(struct Qdisc *); > }; > > > Right now the verifier should complain on check_reference_leak() if the > struct_ops bpf prog is returning a referenced kptr. > > Unlike an argument, the return type of a function does not have a name to tag. > It is the first case that a struct_ops bpf_prog returning a pointer. One idea is > to assume it must be a trusted pointer (PTR_TRUSTED) and the verifier should > check it is indeed with PTR_TRUSTED flag. > > May be release_reference_state() can be called to assume the kernel will release > it as long as the return pointer type is PTR_TRUSTED and the type matches the > return type of the ops. Take a look at check_return_code().