Re: [PATCH bpf-next v3 1/8] bpf: Add generic attach/detach/query API for multi-progs

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Mon, 10 Jul 2023 09:10:52 +0200

On 7/9/23 7:17 PM, Alexei Starovoitov wrote:
On Fri, Jul 07, 2023 at 07:24:48PM +0200, Daniel Borkmann wrote:
+
+#define BPF_MPROG_KEEP	0
+#define BPF_MPROG_SWAP	1
+#define BPF_MPROG_FREE	2

Please document how this is suppose to be used.
Patch 2 is using BPF_MPROG_FREE in tcx_entry_needs_release().
Where most of the code treats BPF_MPROG_SWAP and BPF_MPROG_FREE as equivalent.
I can guess what it's for, but a comment would help.

Ok, sounds good, will add a comment to these codes.

+#define BPF_MPROG_MAX	64

we've seen buggy user space attaching thousands of tc progs to the same netdev.
I suspect 64 limit will be hit much sooner than expected.

In that sense it's probably good that we'll bail out rather than draining memory
as you had with the cls_bpf case, iirc. As I mentioned given this is not uapi, we
can adapt this further in future when needed if you have cases where you attach
more than 64 progs for a single device.

+#define bpf_mprog_foreach_tuple(entry, fp, cp, t)			\
+	for (fp = &entry->fp_items[0], cp = &entry->parent->cp_items[0];\
+	     ({								\
+		t.prog = READ_ONCE(fp->prog);				\
+		t.link = cp->link;					\
+		t.prog;							\
+	      });							\
+	     fp++, cp++)
+
+#define bpf_mprog_foreach_prog(entry, fp, p)				\
+	for (fp = &entry->fp_items[0];					\
+	     (p = READ_ONCE(fp->prog));					\
+	     fp++)

I have similar questions to Stanislav.

The READ_ONCE/WRITE_ONCE is for the replacement case where we don't need to swap
the whole array, so I annotated all access to fp->prog.

Looks like update/delete/query of bpf_prog should be protected by an external lock
(like RTNL in case of tcx),

Correct for tcx it's rtnl, other users also either have to piggyback on existing
locking or need their own.

but what are the life time rules for 'entry'?
Looking at patch 2 sch_handle_ingress():
struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress);
I suspect the assumption is that bpf_mprog_entry object should be accessed within
RCU critical section. Since tc/tcx and XDP run in napi we have RCU protection there.

Correct.

In the future, for cgroups, bpf_prog_run_array_cg() will keep explicit rcu_read_lock()
before accessing bpf_mprog_entry, right?
And bpf_mprog_commit() assumes that RCU protection.

Both yes.

All fine, but we need to document that mprog mechanism is not suitable for sleepable progs.

Ok, I'll add a comment.

+	if (flags & BPF_F_BEFORE) {
+		tidx = bpf_mprog_pos_before(entry, &rtuple);
+		if (tidx < -1 || (idx >= -1 && tidx != idx)) {
+			ret = tidx < -1 ? tidx : -EDOM;
+			goto out;
+		}
+		idx = tidx;
+	}
+	if (flags & BPF_F_AFTER) {
+		tidx = bpf_mprog_pos_after(entry, &rtuple);
+		if (tidx < 0 || (idx >= -1 && tidx != idx)) {

tidx < 0 vs tidx < -1 for _after vs _before.
Does it have to have this subtle difference?
Can _after and _before have the same semantics for return value?

Yes, they can. In 'after' tidx will never be -1, but I can adapt the condition
so it's the same for the two to avoid confusion.

+			ret = tidx < 0 ? tidx : -EDOM;
+			goto out;
+		}
+		idx = tidx;
+	}