Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Mon, 29 Jan 2024 22:39:43 -0800

On 1/26/24 5:17 PM, Amery Hung wrote:
On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:

On 1/23/24 9:22 PM, Amery Hung wrote:
I looked at the high level of the patchset. The major ops that it wants to be
programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
patch 4 and patch 5).

This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
uapi. It is no long an acceptable way to add new bpf extension.

Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.

Partially. If using struct_ops, I think we'll need another structure
like the following in bpf qdisc to be implemented with struct_ops bpf:

struct bpf_qdisc_ops {
      int (*enqueue) (struct sk_buff *skb)
      void (*dequeue) (void)
      void (*init) (void)
      void (*reset) (void)
};

Then, Qdisc_ops will wrap around them to handle things that cannot be
implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).

We can see how those limitations (calling sch_tree_lock() and returning a ptr)
can be addressed in bpf. This will also help other similar use cases.

For kptr, I wonder if we can support the following semantics in bpf if
they make sense:

I think they are useful but they are not fully supported now.

Some thoughts below.

1. Passing a referenced kptr into a bpf program, which will also need
to be released, or exchanged into maps or allocated objects.

"enqueue" should be the one considering here:

struct Qdisc_ops {
	/* ... */
	int                     (*enqueue)(struct sk_buff *skb,
					   struct Qdisc *sch,
					   struct sk_buff **to_free);

};

The verifier only marks the skb as a trusted kptr but does not mark its 
reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:

	if (prog_args_trusted(prog))
		info->reg_type |= PTR_TRUSTED;

The verifier does not know the skb ownership is passed into the ".enqueue" ops 
and does not know the bpf prog needs to release it or store it in a map.

The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just 
an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at 
acquire_reference_state() which is the useful one here.

Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can 
always acquire_reference_state() for the "struct sk_buff *skb" argument.

Take a look at a recent RFC: 
https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@xxxxxxxxx/
which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch 
is tagging the argument could be NULL by appending "__nullable" to the argument 
name. The verifier will enforce that the bpf prog must check for NULL first.

The similar idea can be used here but with a different tagging (for example, 
"__must_release", admittedly not a good name). While the RFC patch is 
in-progress, for now, may be hardcode for the ".enqueue" ops in 
check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This 
part can be adjusted later once the RFC patch will be in shape.

Then one more thing is to track when the struct_ops bpf prog is actually reading 
the value of the skb pointer. One thing is worth to mention here, e.g. a 
struct_ops prog for enqueue:

SEC("struct_ops")
int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
	     struct sk_buff **to_free)
{
	return bpf_qdisc_drop(skb, sch, to_free);
}

Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array 
of __u64 as the only argument. The skb is actually in ctx[0], sch is in 
ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]), 
btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize 
the reg->ref_obj_id by the id obtained earlier from acquire_reference_state() 
during check_struct_ops_btf_id() somehow.

2. Returning a kptr from a program and treating it as releasing the reference.

e.g. for dequeue:

struct Qdisc_ops {
	/* ... */
	struct sk_buff *        (*dequeue)(struct Qdisc *);
};

Right now the verifier should complain on check_reference_leak() if the 
struct_ops bpf prog is returning a referenced kptr.

Unlike an argument, the return type of a function does not have a name to tag. 
It is the first case that a struct_ops bpf_prog returning a pointer. One idea is 
to assume it must be a trusted pointer (PTR_TRUSTED) and the verifier should 
check it is indeed with PTR_TRUSTED flag.

May be release_reference_state() can be called to assume the kernel will release 
it as long as the return pointer type is PTR_TRUSTED and the type matches the 
return type of the ops. Take a look at check_return_code().