Re: [PATCH bpf-next v5 03/25] bpf: Support bpf_list_head in map values

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 8 Nov 2022 15:01:52 -0800

On Mon, Nov 7, 2022 at 3:10 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
>
> Add the support on the map side to parse, recognize, verify, and build
> metadata table for a new special field of the type struct bpf_list_head.
> To parameterize the bpf_list_head for a certain value type and the
> list_node member it will accept in that value type, we use BTF
> declaration tags.
>
> The definition of bpf_list_head in a map value will be done as follows:
>
> struct foo {
>         struct bpf_list_node node;
>         int data;
> };
>
> struct map_value {
>         struct bpf_list_head head __contains(foo, node);
> };
>
> Then, the bpf_list_head only allows adding to the list 'head' using the
> bpf_list_node 'node' for the type struct foo.
>
> The 'contains' annotation is a BTF declaration tag composed of four
> parts, "contains:name:node" where the name is then used to look up the
> type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
> the lookup. The node defines name of the member in this type that has
> the type struct bpf_list_node, which is actually used for linking into
> the linked list. For now, 'kind' part is hardcoded as struct.
>
> This allows building intrusive linked lists in BPF, using container_of
> to obtain pointer to entry, while being completely type safe from the
> perspective of the verifier. The verifier knows exactly the type of the
> nodes, and knows that list helpers return that type at some fixed offset
> where the bpf_list_node member used for this list exists. The verifier
> also uses this information to disallow adding types that are not
> accepted by a certain list.
>
> For now, no elements can be added to such lists. Support for that is
> coming in future patches, hence draining and freeing items is done with
> a TODO that will be resolved in a future patch.
>
> Note that the bpf_list_head_free function moves the list out to a local
> variable under the lock and releases it, doing the actual draining of
> the list items outside the lock. While this helps with not holding the
> lock for too long pessimizing other concurrent list operations, it is
> also necessary for deadlock prevention: unless every function called in
> the critical section would be notrace, a fentry/fexit program could
> attach and call bpf_map_update_elem again on the map, leading to the
> same lock being acquired if the key matches and lead to a deadlock.
> While this requires some special effort on part of the BPF programmer to
> trigger and is highly unlikely to occur in practice, it is always better
> if we can avoid such a condition.
>
> While notrace would prevent this, doing the draining outside the lock
> has advantages of its own, hence it is used to also fix the deadlock
> related problem.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
> ---
>  include/linux/bpf.h            |  17 ++++
>  include/uapi/linux/bpf.h       |  10 +++
>  kernel/bpf/btf.c               | 143 ++++++++++++++++++++++++++++++++-
>  kernel/bpf/helpers.c           |  32 ++++++++
>  kernel/bpf/syscall.c           |  22 ++++-
>  kernel/bpf/verifier.c          |   7 ++
>  tools/include/uapi/linux/bpf.h |  10 +++
>  7 files changed, 237 insertions(+), 4 deletions(-)
>

[...]

>  struct bpf_offload_dev;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 94659f6b3395..dd381086bad9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6887,6 +6887,16 @@ struct bpf_dynptr {
>         __u64 :64;
>  } __attribute__((aligned(8)));
>
> +struct bpf_list_head {
> +       __u64 :64;
> +       __u64 :64;
> +} __attribute__((aligned(8)));
> +
> +struct bpf_list_node {
> +       __u64 :64;
> +       __u64 :64;
> +} __attribute__((aligned(8)));

Dave mentioned that this `__u64 :64` trick makes vmlinux.h lose the
alignment information, as the struct itself is empty, and so there is
nothing indicating that it has to be 8-byte aligned.

So what if we have

struct bpf_list_node {
    __u64 __opaque[2];
} __attribute__((aligned(8)));

?

> +
>  struct bpf_sysctl {
>         __u32   write;          /* Sysctl is being read (= 0) or written (= 1).
>                                  * Allows 1,2,4-byte read, but no write.
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c

[...]

> @@ -3284,6 +3347,12 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
>                         goto end;
>                 }
>         }
> +       if (field_mask & BPF_LIST_HEAD) {
> +               if (!strcmp(name, "bpf_list_head")) {
> +                       type = BPF_LIST_HEAD;
> +                       goto end;
> +               }
> +       }
>         /* Only return BPF_KPTR when all other types with matchable names fail */
>         if (field_mask & BPF_KPTR) {
>                 type = BPF_KPTR_REF;
> @@ -3317,6 +3386,8 @@ static int btf_find_struct_field(const struct btf *btf,
>                         return field_type;
>
>                 off = __btf_member_bit_offset(t, member);
> +               if (i && !off)
> +                       return -EFAULT;

why? why can't my struct has zero-sized field in the beginning? This
seems like a very incomplete and unnecessary check to me.

>                 if (off % 8)
>                         /* valid C code cannot generate such BTF */
>                         return -EINVAL;
> @@ -3339,6 +3410,12 @@ static int btf_find_struct_field(const struct btf *btf,
>                         if (ret < 0)
>                                 return ret;
>                         break;
> +               case BPF_LIST_HEAD:
> +                       ret = btf_find_list_head(btf, t, member_type, i, off, sz,
> +                                                idx < info_cnt ? &info[idx] : &tmp);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
>                 default:
>                         return -EFAULT;
>                 }
> @@ -3373,6 +3450,8 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                         return field_type;
>
>                 off = vsi->offset;
> +               if (i && !off)
> +                       return -EFAULT;

similarly, I'd say that either we'd need to calculate the exact
expected offset, or just not do anything here?

>                 if (vsi->size != sz)
>                         continue;
>                 if (off % align)
> @@ -3393,6 +3472,12 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>                         if (ret < 0)
>                                 return ret;
>                         break;
> +               case BPF_LIST_HEAD:
> +                       ret = btf_find_list_head(btf, var, var_type, -1, off, sz,
> +                                                idx < info_cnt ? &info[idx] : &tmp);
> +                       if (ret < 0)
> +                               return ret;
> +                       break;
>                 default:
>                         return -EFAULT;
>                 }

[...]