Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Thu, 25 Jan 2024 18:22:31 -0800

On 1/23/24 9:22 PM, Amery Hung wrote:
I looked at the high level of the patchset. The major ops that it wants to be
programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
patch 4 and patch 5).

This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
uapi. It is no long an acceptable way to add new bpf extension.

Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.

Partially. If using struct_ops, I think we'll need another structure
like the following in bpf qdisc to be implemented with struct_ops bpf:

struct bpf_qdisc_ops {
     int (*enqueue) (struct sk_buff *skb)
     void (*dequeue) (void)
     void (*init) (void)
     void (*reset) (void)
};

Then, Qdisc_ops will wrap around them to handle things that cannot be
implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).

We can see how those limitations (calling sch_tree_lock() and returning a ptr) 
can be addressed in bpf. This will also help other similar use cases.

Other than sch_tree_lock and returning a ptr from a bpf prog. What else do you 
see that blocks directly implementing the enqueue/dequeue/init/reset in the 
struct Qdisc_ops?

Have you thought above ".priv_size"? It is now fixed to sizeof(struct 
bpf_sched_data). It should be useful to allow the bpf prog to store its own data 
there?

If other ops (like ".dump", ".dump_stats"...) do not have good use case to be
programmable in bpf, it can stay with the kernel implementation for now and only
allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset
implemented.

You mentioned in the cover letter that:
"Current struct_ops attachment model does not seem to support replacing only
functions of a specific instance of a module, but I might be wrong."

I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes,
it can be done through the struct_ops's ".init_member". Take a look at
bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for
".dump" (for example) when loading the bpf Qdisc_ops.

I have no problem with partially replacing a struct, which like you
mentioned has been demonstrated by congestion control or sched_ext.
What I am not sure about is the ability to create multiple bpf qdiscs,
where each has different ".enqueue", ".dequeue", and so on. I like the
struct_ops approach and would love to try it if struct_ops support
this.

The need for allowing different ".enqueue/.dequeue/..." bpf 
(BPF_PROG_TYPE_QDISC) programs loaded into different qdisc instances is because 
there is only one ".id == bpf" Qdisc_ops native kernel implementation which is 
then because of the limitation you mentioned above?

Am I understanding your reason correctly on why it requires to load different 
bpf prog for different qdisc instances?

If the ".enqueue/.dequeue/..." in the "struct Qdisc_ops" can be directly 
implemented in bpf prog itself, it can just load another bpf struct_ops which 
has a different ".enqueue/.dequeue/..." implementation:

#> bpftool struct_ops register bpf_simple_fq_v1.bpf.o
#> bpftool struct_ops register bpf_simple_fq_v2.bpf.o
#> bpftool struct_ops register bpf_simple_fq_xyz.bpf.o

From reading the test bpf prog, I think the set is on a good footing. Instead 
of working around the limitation by wrapping the bpf prog in a predefined 
"struct Qdisc_ops sch_bpf_qdisc_ops", lets first understand what is missing in 
bpf and see how we could address them.