On Mon, Nov 7, 2022 at 3:10 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: > > Add the support on the map side to parse, recognize, verify, and build > metadata table for a new special field of the type struct bpf_list_head. > To parameterize the bpf_list_head for a certain value type and the > list_node member it will accept in that value type, we use BTF > declaration tags. > > The definition of bpf_list_head in a map value will be done as follows: > > struct foo { > struct bpf_list_node node; > int data; > }; > > struct map_value { > struct bpf_list_head head __contains(foo, node); > }; > > Then, the bpf_list_head only allows adding to the list 'head' using the > bpf_list_node 'node' for the type struct foo. > > The 'contains' annotation is a BTF declaration tag composed of four > parts, "contains:name:node" where the name is then used to look up the > type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during > the lookup. The node defines name of the member in this type that has > the type struct bpf_list_node, which is actually used for linking into > the linked list. For now, 'kind' part is hardcoded as struct. > > This allows building intrusive linked lists in BPF, using container_of > to obtain pointer to entry, while being completely type safe from the > perspective of the verifier. The verifier knows exactly the type of the > nodes, and knows that list helpers return that type at some fixed offset > where the bpf_list_node member used for this list exists. The verifier > also uses this information to disallow adding types that are not > accepted by a certain list. > > For now, no elements can be added to such lists. Support for that is > coming in future patches, hence draining and freeing items is done with > a TODO that will be resolved in a future patch. > > Note that the bpf_list_head_free function moves the list out to a local > variable under the lock and releases it, doing the actual draining of > the list items outside the lock. While this helps with not holding the > lock for too long pessimizing other concurrent list operations, it is > also necessary for deadlock prevention: unless every function called in > the critical section would be notrace, a fentry/fexit program could > attach and call bpf_map_update_elem again on the map, leading to the > same lock being acquired if the key matches and lead to a deadlock. > While this requires some special effort on part of the BPF programmer to > trigger and is highly unlikely to occur in practice, it is always better > if we can avoid such a condition. > > While notrace would prevent this, doing the draining outside the lock > has advantages of its own, hence it is used to also fix the deadlock > related problem. > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> > --- > include/linux/bpf.h | 17 ++++ > include/uapi/linux/bpf.h | 10 +++ > kernel/bpf/btf.c | 143 ++++++++++++++++++++++++++++++++- > kernel/bpf/helpers.c | 32 ++++++++ > kernel/bpf/syscall.c | 22 ++++- > kernel/bpf/verifier.c | 7 ++ > tools/include/uapi/linux/bpf.h | 10 +++ > 7 files changed, 237 insertions(+), 4 deletions(-) > [...] > struct bpf_offload_dev; > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 94659f6b3395..dd381086bad9 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -6887,6 +6887,16 @@ struct bpf_dynptr { > __u64 :64; > } __attribute__((aligned(8))); > > +struct bpf_list_head { > + __u64 :64; > + __u64 :64; > +} __attribute__((aligned(8))); > + > +struct bpf_list_node { > + __u64 :64; > + __u64 :64; > +} __attribute__((aligned(8))); Dave mentioned that this `__u64 :64` trick makes vmlinux.h lose the alignment information, as the struct itself is empty, and so there is nothing indicating that it has to be 8-byte aligned. So what if we have struct bpf_list_node { __u64 __opaque[2]; } __attribute__((aligned(8))); ? > + > struct bpf_sysctl { > __u32 write; /* Sysctl is being read (= 0) or written (= 1). > * Allows 1,2,4-byte read, but no write. > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c [...] > @@ -3284,6 +3347,12 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask, > goto end; > } > } > + if (field_mask & BPF_LIST_HEAD) { > + if (!strcmp(name, "bpf_list_head")) { > + type = BPF_LIST_HEAD; > + goto end; > + } > + } > /* Only return BPF_KPTR when all other types with matchable names fail */ > if (field_mask & BPF_KPTR) { > type = BPF_KPTR_REF; > @@ -3317,6 +3386,8 @@ static int btf_find_struct_field(const struct btf *btf, > return field_type; > > off = __btf_member_bit_offset(t, member); > + if (i && !off) > + return -EFAULT; why? why can't my struct has zero-sized field in the beginning? This seems like a very incomplete and unnecessary check to me. > if (off % 8) > /* valid C code cannot generate such BTF */ > return -EINVAL; > @@ -3339,6 +3410,12 @@ static int btf_find_struct_field(const struct btf *btf, > if (ret < 0) > return ret; > break; > + case BPF_LIST_HEAD: > + ret = btf_find_list_head(btf, t, member_type, i, off, sz, > + idx < info_cnt ? &info[idx] : &tmp); > + if (ret < 0) > + return ret; > + break; > default: > return -EFAULT; > } > @@ -3373,6 +3450,8 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t, > return field_type; > > off = vsi->offset; > + if (i && !off) > + return -EFAULT; similarly, I'd say that either we'd need to calculate the exact expected offset, or just not do anything here? > if (vsi->size != sz) > continue; > if (off % align) > @@ -3393,6 +3472,12 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t, > if (ret < 0) > return ret; > break; > + case BPF_LIST_HEAD: > + ret = btf_find_list_head(btf, var, var_type, -1, off, sz, > + idx < info_cnt ? &info[idx] : &tmp); > + if (ret < 0) > + return ret; > + break; > default: > return -EFAULT; > } [...]