[PATCH RFC bpf-next] bpf: add __ctx_ref kfunc argument suffix

Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> · Tue, 3 Dec 2024 21:00:44 +0100

The use case is when user wants to use sk_buff pointers as kptrs against
program types that take in __sk_buff struct as context.

A pair of kfuncs for acquiring and releasing skb would look as follows:

__bpf_kfunc struct sk_buff *bpf_skb_acquire(struct sk_buff *skb__ctx_ref)
__bpf_kfunc void bpf_skb_release(struct sk_buff *skb__ctx_ref)

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx>
---
 kernel/bpf/verifier.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 60938b136365..b16a39d28f8a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11303,6 +11303,11 @@ static bool is_kfunc_arg_const_str(const struct btf *btf, const struct btf_param
 	return btf_param_match_suffix(btf, arg, "__str");
 }
 
+static bool is_kfunc_arg_ctx_ref(const struct btf *btf, const struct btf_param *arg)
+{
+	return btf_param_match_suffix(btf, arg, "__ctx_ref");
+}
+
 static bool is_kfunc_arg_scalar_with_name(const struct btf *btf,
 					  const struct btf_param *arg,
 					  const char *name)
@@ -11599,6 +11604,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (meta->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx])
 		return KF_ARG_PTR_TO_CTX;
 
+	if (is_kfunc_arg_ctx_ref(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_BTF_ID;
+
 	/* In this function, we verify the kfunc's BTF as per the argument type,
 	 * leaving the rest of the verification with respect to the register
 	 * type to our caller. When a set of conditions hold in the BTF type of
-- 
2.34.1

Then below example progs store the skb on TC tx and retrieve it later on
when TC rx hook is called. This is a ping-pong of packets between veth
pair interfaces. We start with Tx on interface with progs attached and
then receive the packets back. I have stripped the other logic so that we
focus only on kptr storage:

-------------------------------8<-------------------------------

struct sk_buff *bpf_skb_acquire(struct sk_buff *skb) __ksym;
void bpf_skb_release(struct sk_buff *skb) __ksym;
void *bpf_cast_to_kern_ctx(void *) __ksym;

u32 key = 0;
u32 get_key = 0;

struct map_value {
	struct sk_buff __kptr * skb;
};

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __type(key, int);
    __type(value, struct map_value);
    __uint(max_entries, 16);
} skb_map SEC(".maps");

SEC("tc") int bpf_tc_ingress(struct __sk_buff *skb)
{
	struct map_value *map_entry;
	struct sk_buff *tmp = NULL;

	map_entry = bpf_map_lookup_elem(&skb_map, &get_key);
	if (!map_entry) {
		bpf_printk("no entry at get key %d\n", get_key);
		return TC_ACT_SHOT;
	}

	tmp = bpf_kptr_xchg(&map_entry->skb, tmp);
	if (!tmp)
		return TC_ACT_UNSPEC;

	bpf_printk("retrieved skb %p from index %d\n", tmp, get_key);

	get_key++;
	if (get_key >= 16)
		get_key = 0;

	bpf_skb_release(tmp);

	return TC_ACT_OK;
}

SEC("tc") int bpf_tc_egress(struct __sk_buff *ctx)
{
	struct sk_buff *skbk = bpf_cast_to_kern_ctx(ctx);
	struct map_value *map_entry;
	struct map_value tmp_entry;
	struct sk_buff *tmp;

	tmp_entry.skb = NULL;
	bpf_map_update_elem(&skb_map, &key, &tmp_entry, BPF_ANY);
	map_entry = bpf_map_lookup_elem(&skb_map, &key);
	if (!map_entry)
		return TC_ACT_SHOT;

	skbk = bpf_skb_acquire(skbk);
	if (!skbk)
		return TC_ACT_SHOT;

	tmp = bpf_kptr_xchg(&map_entry->skb, skbk);
	if (tmp)
		bpf_skb_release(tmp);
	bpf_printk("stored skb %p at index %d\n", skbk, key);

	key++;
	if (key >= 16)
		key = 0;

	return TC_ACT_OK;
}
char LICENSE[] SEC("license") = "GPL";

------------------------------->8-------------------------------

> 
> > I tried to simplify the use case that customer has, but I am a bit worried
> > that it might only confuse people more :/ however, here it is:
> 
> No. not at all. I suspect the use case has some similarity to the
> net-timestamp patches (https://lore.kernel.org/bpf/20241028110535.82999-1-kerneljasonxing@xxxxxxxxx/)
> which uses a skb tskey to associate/co-relate different timestamp.

I was aware of this work and I will give this a deep look, thanks.

> 
> > On TC egress hook skb is stored in a map - reason for picking it over the
> > linked list or rbtree is that we want to be able to access skbs via some index,
> > say a hash. This is where we bump the skb's refcount via acquire kfunc.
> > 
> > During TC ingress hook on the same interface, the skb that was previously
> > stored in map is retrieved, current skb that resides in the context of
> > hook carries the timestamp via metadata. We then use the retrieved skb and
> > tstamp from metadata on skb_tstamp_tx() (another kfunc) and finally
> > decrement skb's refcount via release kfunc.
> > 
> > 
> > Anyways, since we are able to do similar operations on task_struct
> > (holding it in map via kptr), I don't see a reason why wouldn't we allow
> > ourselves to do it on sk_buffs, no?
> 
> skb holds other things like dev and dst, like someone may be trying to
> remove the netdevice and route...etc. Overall, yes, the skb refcnt will
> eventually be decremented when the map is freed like other kptr (e.g. task)
> do.
>