Re: [PATCH bpf-next v2 00/11] Enable BPF programs to declare arrays of kptr, bpf_rb_root, and bpf_list_head.

Kui-Feng Lee <sinquersw@xxxxxxxxx> · Fri, 19 Apr 2024 11:36:09 -0700

On 4/18/24 07:53, Alexei Starovoitov wrote:
On Wed, Apr 17, 2024 at 11:07 PM Kui-Feng Lee <sinquersw@xxxxxxxxx> wrote:



On 4/17/24 22:11, Alexei Starovoitov wrote:
On Wed, Apr 17, 2024 at 9:31 PM Kui-Feng Lee <sinquersw@xxxxxxxxx> wrote:



On 4/17/24 20:30, Alexei Starovoitov wrote:
On Fri, Apr 12, 2024 at 2:08 PM Kui-Feng Lee <thinker.li@xxxxxxxxx> wrote:

The arrays of kptr, bpf_rb_root, and bpf_list_head didn't work as
global variables. This was due to these types being initialized and
verified in a special manner in the kernel. This patchset allows BPF
programs to declare arrays of kptr, bpf_rb_root, and bpf_list_head in
the global namespace.

The main change is to add "nelems" to btf_fields. The value of
"nelems" represents the number of elements in the array if a btf_field
represents an array. Otherwise, "nelem" will be 1. The verifier
verifies these types based on the information provided by the
btf_field.

The value of "size" will be the size of the entire array if a
btf_field represents an array. Dividing "size" by "nelems" gives the
size of an element. The value of "offset" will be the offset of the
beginning for an array. By putting this together, we can determine the
offset of each element in an array. For example,

       struct bpf_cpumask __kptr * global_mask_array[2];

Looks like this patch set enables arrays only.
Meaning the following is supported already:

+private(C) struct bpf_spin_lock glock_c;
+private(C) struct bpf_list_head ghead_array1 __contains(foo, node2);
+private(C) struct bpf_list_head ghead_array2 __contains(foo, node2);

while this support is added:

+private(C) struct bpf_spin_lock glock_c;
+private(C) struct bpf_list_head ghead_array1[3] __contains(foo, node2);
+private(C) struct bpf_list_head ghead_array2[2] __contains(foo, node2);

Am I right?

What about the case when bpf_list_head is wrapped in a struct?
private(C) struct foo {
     struct bpf_list_head ghead;
} ghead;

that's not enabled in this patch. I think.

And the following:
private(C) struct foo {
     struct bpf_list_head ghead;
} ghead[2];


or

private(C) struct foo {
     struct bpf_list_head ghead[2];
} ghead;

Won't work either.

No, they don't work.
We had a discussion about this in the other day.
I proposed to have another patch set to work on struct types.
Do you prefer to handle it in this patch set?


I think eventually we want to support all such combinations and
the approach proposed in this patch with 'nelems'
won't work for wrapper structs.

I think it's better to unroll/flatten all structs and arrays
and represent them as individual elements in the flattened
structure. Then there will be no need to special case array with 'nelems'.
All special BTF types will be individual elements with unique offset.

Does this make sense?

That means it will creates 10 btf_field(s) for an array having 10
elements. The purpose of adding "nelems" is to avoid the repetition. Do
you prefer to expand them?

It's not just expansion, but a common way to handle nested structs too.

I suspect by delaying nested into another patchset this approach
will become useless.

So try adding nested structs in all combinations as a follow up and
I suspect you're realize that "nelems" approach doesn't really help.
You'd need to flatten them all.
And once you do there is no need for "nelems".

For me, "nelems" is more like a choice of avoiding repetition of
information, not a necessary. Before adding "nelems", I had considered
to expand them as well. But, eventually, I chose to add "nelems".

Since you think this repetition is not a problem, I will expand array as
individual elements.

You don't sound convinced :)
Please add support for nested structs on top of your "nelems" approach
and prototype the same without "nelems" and let's compare the two.


I have an implementation following with "nelems".
The basic idea is to introduce field type BPF_REPEAT_FIELDS to repeat
some btf_field immediately before if necessary.

For example,

  struct foo {
    struct bpf_cpumask __kptr *a;
    struct bpf_cpumask __kptr *b;
  };

  struct foo fooptrs[10];

it will create two btf_fields for a & b, like

  [kptr_a, kptr_b]

However, fooptrs is any array of size 10. It will create another field
of BPF_REPEAT_FIELDS to repeat two fields immediate before for 9 times.

  [kptr_a, kptr_b, repeat_fields(nelems=9, repeated_cnt=2)]

The size of the repeat_fields with be the size of an element times 9,
and offset of the  repeat_fields will be the offset of &fooptrs[1].

Even struct foo is in another struct nested, it would still create the
same/or similar result. For example,

  struct foo_deep {
    int dummy;
    struct foo inner;
  };

  struct foo_deep deep_ptrs[10];

it will create the similar fields with different offsets.

  [kptr_a, kptr_b, repeated_fields(nelems=9, repeated_cnt=2)]

What if nested with array?

  struct foo_deep_arr {
    int dummy;
    struct foo inner[4];
  }

it will create fields like,

  [kptr_a, kptr_b, repeated_fields(nelems=3, repeated_cnt=2),
   repeated_fields(nelems=9, repeated_cnt=3)]

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b25dd498b737..bfd31d2a9770 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -202,6 +202,7 @@ enum btf_field_type {
 	BPF_GRAPH_NODE = BPF_RB_NODE | BPF_LIST_NODE,
 	BPF_GRAPH_ROOT = BPF_RB_ROOT | BPF_LIST_HEAD,
 	BPF_REFCOUNT   = (1 << 9),
+	BPF_REPEAT_FIELDS = (1 << 10),
 };

 typedef void (*btf_dtor_kfunc_t)(void *);
@@ -231,6 +232,7 @@ struct btf_field {
 	union {
 		struct btf_field_kptr kptr;
 		struct btf_field_graph_root graph_root;
+		u32 repeated_cnt;
 	};
 };

@@ -489,6 +491,21 @@ static inline void bpf_obj_memcpy(struct btf_record 
*rec,
 		u32 next_off = rec->fields[i].offset;
 		u32 sz = next_off - curr_off;

+		if (rec->fields[i].type == BPF_REPEAT_FIELDS) {
+			int cnt = rec->fields[i].repeated_cnt;
+			int elem_size = rec->fields[i].size / rec->fields[i].nelems;
+			int j, k;
+			for (j = 0; j < rec->fields[i].nelems; j++) {
+				for (k = i - cnt; k < i; k++) {
+					/* Use repated fields to copy */
+					next_off = rec->fields[k].offset + elem_size + elem_size * j;
+					sz = next_off - curr_off;
+					memcpy(dst + curr_off, src + curr_off, sz);
+					curr_off += rec->fields[k].size + sz;
+				}
+			}
+			continue;
+		}
 		memcpy(dst + curr_off, src + curr_off, sz);
 		curr_off += rec->fields[i].size + sz;
 	}
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index ae17d3996843..b8acec702557 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3298,6 +3298,10 @@ struct btf_field_info {
 			const char *node_name;
 			u32 value_btf_id;
 		} graph_root;
+		struct {
+			u32 cnt;
+			u32 elem_size;
+		} repeat;
 	};
 };

@@ -3485,6 +3489,39 @@ static int btf_get_field_type(const char *name, 
u32 field_mask, u32 *seen_mask,

 #undef field_mask_test_name

+static int btf_find_struct_field(const struct btf *btf,
+				 const struct btf_type *t, u32 field_mask,
+				 struct btf_field_info *info, int info_cnt);
+
+static int btf_find_nested_struct(const struct btf *btf, const struct 
btf_type *t,
+				  u32 off, u32 nelems,
+				  u32 field_mask, struct btf_field_info *info,
+				  int info_cnt)
+{
+	int ret, i;
+
+	ret = btf_find_struct_field(btf, t, field_mask, info, info_cnt);
+
+	if (ret <= 0)
+		return ret;
+
+	for (i = 0; i < ret; i++)
+		info[i].off += off;
+
+	if (nelems > 1) {
+		if (ret >= info_cnt)
+			return -E2BIG;
+		info[ret].type = BPF_REPEAT_FIELDS;
+		info[ret].off = off + t->size;
+		info[ret].nelems = nelems - 1;
+		info[ret].repeat.cnt = ret;
+		info[ret].repeat.elem_size = t->size;
+		ret += 1;
+	}
+
+	return ret;
+}
+
 static int btf_find_struct_field(const struct btf *btf,
 				 const struct btf_type *t, u32 field_mask,
 				 struct btf_field_info *info, int info_cnt)
@@ -3497,9 +3534,26 @@ static int btf_find_struct_field(const struct btf 
*btf,
 	for_each_member(i, t, member) {
 		const struct btf_type *member_type = btf_type_by_id(btf,
 								    member->type);
+		const struct btf_array *array;
+		u32 j, nelems = 1;
+
+		/* Walk into array types to find the element type and the
+		 * number of elements in the (flattened) array.
+		 */
+		for (j = 0; j < MAX_RESOLVE_DEPTH && btf_type_is_array(member_type); 
j++) {
+			array = btf_array(member_type);
+			nelems *= array->nelems;
+			member_type = btf_type_by_id(btf, array->type);
+		}
+		if (nelems == 0)
+			continue;

-		field_type = btf_get_field_type(__btf_name_by_offset(btf, 
member_type->name_off),
-						field_mask, &seen_mask, &align, &sz);
+		if ((field_mask & BPF_REPEAT_FIELDS) &&
+		    __btf_type_is_struct(member_type))
+			field_type = BPF_REPEAT_FIELDS;
+		else
+			field_type = btf_get_field_type(__btf_name_by_offset(btf, 
member_type->name_off),
+							field_mask, &seen_mask, &align, &sz);
 		if (field_type == 0)
 			continue;
 		if (field_type < 0)
@@ -3541,6 +3595,15 @@ static int btf_find_struct_field(const struct btf 
*btf,
 			if (ret < 0)
 				return ret;
 			break;
+		case BPF_REPEAT_FIELDS:
+			ret = btf_find_nested_struct(btf, member_type, off, nelems, field_mask,
+						    &info[idx], info_cnt - idx);
+			if (ret < 0)
+				return ret;
+			idx += ret;
+			if (idx >= info_cnt)
+				return -E2BIG;
+			continue;
 		default:
 			return -EFAULT;
 		}
@@ -3549,7 +3612,7 @@ static int btf_find_struct_field(const struct btf 
*btf,
 			continue;
 		if (idx >= info_cnt)
 			return -E2BIG;
-		info[idx].nelems = 1;
+		info[idx].nelems = nelems;
 		++idx;
 	}
 	return idx;
@@ -3581,8 +3644,13 @@ static int btf_find_datasec_var(const struct btf 
*btf, const struct btf_type *t,
 		if (nelems == 0)
 			continue;

-		field_type = btf_get_field_type(__btf_name_by_offset(btf, 
var_type->name_off),
-						field_mask, &seen_mask, &align, &sz);
+		if ((field_mask & BPF_REPEAT_FIELDS) &&
+		    __btf_type_is_struct(var_type)) {
+			field_type = BPF_REPEAT_FIELDS;
+			sz = var_type->size;
+		} else
+			field_type = btf_get_field_type(__btf_name_by_offset(btf, 
var_type->name_off),
+							field_mask, &seen_mask, &align, &sz);
 		if (field_type == 0)
 			continue;
 		if (field_type < 0)
@@ -3624,6 +3692,15 @@ static int btf_find_datasec_var(const struct btf 
*btf, const struct btf_type *t,
 			if (ret < 0)
 				return ret;
 			break;
+		case BPF_REPEAT_FIELDS:
+			ret = btf_find_nested_struct(btf, var_type, off, nelems, field_mask,
+						     &info[idx], info_cnt - idx);
+			if (ret < 0)
+				return ret;
+			idx += ret;
+			if (idx >= info_cnt)
+				return -E2BIG;
+			continue;
 		default:
 			return -EFAULT;
 		}
@@ -3634,6 +3711,7 @@ static int btf_find_datasec_var(const struct btf 
*btf, const struct btf_type *t,
 			return -E2BIG;
 		info[idx++].nelems = nelems;
 	}
+
 	return idx;
 }

@@ -3835,19 +3913,24 @@ struct btf_record *btf_parse_fields(const struct 
btf *btf, const struct btf_type
 	rec->timer_off = -EINVAL;
 	rec->refcount_off = -EINVAL;
 	for (i = 0; i < cnt; i++) {
-		field_type_size = btf_field_type_size(info_arr[i].type) * 
info_arr[i].nelems;
+		if (info_arr[i].type == BPF_REPEAT_FIELDS)
+			field_type_size = info_arr[i].repeat.elem_size * info_arr[i].nelems;
+		else
+			field_type_size = btf_field_type_size(info_arr[i].type) * 
info_arr[i].nelems;
 		if (info_arr[i].off + field_type_size > value_size) {
 			WARN_ONCE(1, "verifier bug off %d size %d", info_arr[i].off, 
value_size);
 			ret = -EFAULT;
 			goto end;
 		}
-		if (info_arr[i].off < next_off) {
+		if (info_arr[i].off < next_off &&
+		    info_arr[i].type != BPF_REPEAT_FIELDS) {
 			ret = -EEXIST;
 			goto end;
 		}
 		next_off = info_arr[i].off + field_type_size;

-		rec->field_mask |= info_arr[i].type;
+		if (info_arr[i].type != BPF_REPEAT_FIELDS)
+			rec->field_mask |= info_arr[i].type;
 		rec->fields[i].offset = info_arr[i].off;
 		rec->fields[i].type = info_arr[i].type;
 		rec->fields[i].size = field_type_size;
@@ -3889,6 +3972,10 @@ struct btf_record *btf_parse_fields(const struct 
btf *btf, const struct btf_type
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
 			break;
+
+		case BPF_REPEAT_FIELDS:
+			rec->fields[i].repeated_cnt = info_arr[i].repeat.cnt;
+			break;
 		default:
 			ret = -EFAULT;
 			goto end;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1c8a9bc00d17..0effa1daf2ca 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -532,15 +532,27 @@ static int btf_field_cmp(const void *a, const void *b)
 struct btf_field *btf_record_find(const struct btf_record *rec, u32 
offset,
 				  u32 field_mask)
 {
+	const struct btf_field *fields;
 	struct btf_field *field;
 	struct btf_field key = {
 		.offset = offset,
 		.size = 0,	/* as a label for this key */
 	};
+	u32 cnt, elem_size;

 	if (IS_ERR_OR_NULL(rec) || !(rec->field_mask & field_mask))
 		return NULL;
-	field = bsearch(&key, rec->fields, rec->cnt, sizeof(rec->fields[0]), 
btf_field_cmp);
+	fields = rec->fields;
+	cnt = rec->cnt;
+	while ((field = bsearch(&key, fields, cnt, sizeof(rec->fields[0]), 
btf_field_cmp)) && field->type == BPF_REPEAT_FIELDS) {
+		cnt = field->repeated_cnt;
+		fields = field - cnt;
+		elem_size = field->size / field->nelems;
+		/* Redirect to the offset of repeated fields */
+		offset = offset - field->offset;
+		offset = field->offset + (offset % elem_size) - elem_size;
+		key.offset = offset;
+	}
 	if (!field || !(field->type & field_mask))
 		return NULL;
 	if ((offset - field->offset) % (field->size / field->nelems))
@@ -1106,7 +1118,7 @@ static int map_check_btf(struct bpf_map *map, 
struct bpf_token *token,

 	map->record = btf_parse_fields(btf, value_type,
 				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
-				       BPF_RB_ROOT | BPF_REFCOUNT,
+				       BPF_RB_ROOT | BPF_REFCOUNT | BPF_REPEAT_FIELDS,
 				       map->value_size);
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 67b89d4ea1ba..45b2da8a00d1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5416,6 +5416,7 @@ static int check_map_access(struct 
bpf_verifier_env *env, u32 regno,
 	struct bpf_reg_state *reg = &state->regs[regno];
 	struct bpf_map *map = reg->map_ptr;
 	struct btf_record *rec;
+	u32 cnt;
 	int err, i;

 	err = check_mem_region_access(env, regno, off, size, map->value_size,
@@ -5426,7 +5427,8 @@ static int check_map_access(struct 
bpf_verifier_env *env, u32 regno,
 	if (IS_ERR_OR_NULL(map->record))
 		return 0;
 	rec = map->record;
-	for (i = 0; i < rec->cnt; i++) {
+	cnt = rec->cnt;
+	for (i = 0; i < cnt; i++) {
 		struct btf_field *field = &rec->fields[i];
 		u32 p = field->offset, var_p, elem_size;

@@ -5461,6 +5463,18 @@ static int check_map_access(struct 
bpf_verifier_env *env, u32 regno,
 					return -EACCES;
 				}
 				break;
+			case BPF_REPEAT_FIELDS:
+				var_p = off + reg->var_off.value;
+				if (var_p < p || var_p >= p + field->size)
+					break;
+				elem_size = field->size / field->nelems;
+				/* Redirect to the offset of repeated
+				 * fields
+				 */
+				off = p + ((var_p - p) % elem_size) - reg->var_off.value - elem_size;
+				cnt = i;
+				i -= field->repeated_cnt + 1;
+				break;
 			default:
 				verbose(env, "%s cannot be accessed directly by load/store\n",
 					btf_field_type_name(field->type));
diff --git a/tools/testing/selftests/bpf/prog_tests/cpumask.c 
b/tools/testing/selftests/bpf/prog_tests/cpumask.c
index bba601e235f6..2570bd4b0cb2 100644
--- a/tools/testing/selftests/bpf/prog_tests/cpumask.c
+++ b/tools/testing/selftests/bpf/prog_tests/cpumask.c
@@ -21,6 +21,8 @@ static const char * const cpumask_success_testcases[] = {
 	"test_global_mask_array_one_rcu",
 	"test_global_mask_array_rcu",
 	"test_global_mask_array_l2_rcu",
+	"test_global_mask_nested_rcu",
+	"test_global_mask_nested_deep_rcu",
 	"test_cpumask_weight",
 };

diff --git a/tools/testing/selftests/bpf/progs/cpumask_success.c 
b/tools/testing/selftests/bpf/progs/cpumask_success.c
index 9d76d85680d7..0de6bc115f55 100644
--- a/tools/testing/selftests/bpf/progs/cpumask_success.c
+++ b/tools/testing/selftests/bpf/progs/cpumask_success.c
@@ -11,9 +11,21 @@
 char _license[] SEC("license") = "GPL";

 int pid, nr_cpus;
+struct kptr_nested {
+	struct bpf_cpumask __kptr * mask;
+};
+struct kptr_nested_mid {
+	int dummy;
+	struct kptr_nested m;
+};
+struct kptr_nested_deep {
+	struct kptr_nested_mid ptrs[2];
+};
 private(MASK) static struct bpf_cpumask __kptr * global_mask_array[2];
 private(MASK) static struct bpf_cpumask __kptr * 
global_mask_array_l2[2][1];
 private(MASK) static struct bpf_cpumask __kptr * global_mask_array_one[1];
+private(MASK) static struct kptr_nested global_mask_nested[2];
+private(MASK) static struct kptr_nested_deep global_mask_nested_deep;

 static bool is_test_task(void)
 {
@@ -553,6 +565,71 @@ int BPF_PROG(test_global_mask_array_rcu, struct 
task_struct *task, u64 clone_fla
 	return 0;
 }

+static int _global_mask_nested_rcu(struct bpf_cpumask **mask0,
+				       struct bpf_cpumask **mask1)
+{
+	struct bpf_cpumask *local;
+
+	if (!is_test_task())
+		return 0;
+
+	/* Check if two kptrs in the array work and independently */
+
+	local = create_cpumask();
+	if (!local)
+		return 0;
+
+	bpf_rcu_read_lock();
+
+	local = bpf_kptr_xchg(mask0, local);
+	if (local) {
+		err = 1;
+		goto err_exit;
+	}
+
+	/* [<mask 0>, NULL] */
+	if (!*mask0 || *mask1) {
+		err = 2;
+		goto err_exit;
+	}
+
+	local = create_cpumask();
+	if (!local) {
+		err = 9;
+		goto err_exit;
+	}
+
+	local = bpf_kptr_xchg(mask1, local);
+	if (local) {
+		err = 10;
+		goto err_exit;
+	}
+
+	/* [<mask 0>, <mask 1>] */
+	if (!*mask0 || !*mask1 || *mask0 == *mask1) {
+		err = 11;
+		goto err_exit;
+	}
+
+err_exit:
+	if (local)
+		bpf_cpumask_release(local);
+	bpf_rcu_read_unlock();
+	return 0;
+}
+
+SEC("tp_btf/task_newtask")
+int BPF_PROG(test_global_mask_nested_rcu, struct task_struct *task, u64 
clone_flags)
+{
+	return _global_mask_nested_rcu(&global_mask_nested[0].mask, 
&global_mask_nested[1].mask);
+}
+
+SEC("tp_btf/task_newtask")
+int BPF_PROG(test_global_mask_nested_deep_rcu, struct task_struct 
*task, u64 clone_flags)
+{
+	return 
_global_mask_nested_rcu(&global_mask_nested_deep.ptrs[0].m.mask, 
&global_mask_nested_deep.ptrs[1].m.mask);
+}
+
 SEC("tp_btf/task_newtask")
 int BPF_PROG(test_global_mask_array_l2_rcu, struct task_struct *task, 
u64 clone_flags)
 {