Re: [PATCH bpf-next v3 03/13] bpf: Allow storing unreferenced kptr in map

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 22 Mar 2022 11:52:20 -0700

On Tue, Mar 22, 2022 at 12:16 AM Kumar Kartikeya Dwivedi
<memxor@xxxxxxxxx> wrote:
>
> On Tue, Mar 22, 2022 at 11:15:42AM IST, Andrii Nakryiko wrote:
> > On Sun, Mar 20, 2022 at 8:55 AM Kumar Kartikeya Dwivedi
> > <memxor@xxxxxxxxx> wrote:
> > >
> > > This commit introduces a new pointer type 'kptr' which can be embedded
> > > in a map value as holds a PTR_TO_BTF_ID stored by a BPF program during
> > > its invocation. Storing to such a kptr, BPF program's PTR_TO_BTF_ID
> > > register must have the same type as in the map value's BTF, and loading
> > > a kptr marks the destination register as PTR_TO_BTF_ID with the correct
> > > kernel BTF and BTF ID.
> > >
> > > Such kptr are unreferenced, i.e. by the time another invocation of the
> > > BPF program loads this pointer, the object which the pointer points to
> > > may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are
> > > patched to PROBE_MEM loads by the verifier, it would safe to allow user
> > > to still access such invalid pointer, but passing such pointers into
> > > BPF helpers and kfuncs should not be permitted. A future patch in this
> > > series will close this gap.
> > >
> > > The flexibility offered by allowing programs to dereference such invalid
> > > pointers while being safe at runtime frees the verifier from doing
> > > complex lifetime tracking. As long as the user may ensure that the
> > > object remains valid, it can ensure data read by it from the kernel
> > > object is valid.
> > >
> > > The user indicates that a certain pointer must be treated as kptr
> > > capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using
> > > a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this
> > > information is recorded in the object BTF which will be passed into the
> > > kernel by way of map's BTF information. The name and kind from the map
> > > value BTF is used to look up the in-kernel type, and the actual BTF and
> > > BTF ID is recorded in the map struct in a new kptr_off_tab member. For
> > > now, only storing pointers to structs is permitted.
> > >
> > > An example of this specification is shown below:
> > >
> > >         #define __kptr __attribute__((btf_type_tag("kptr")))
> > >
> > >         struct map_value {
> > >                 ...
> > >                 struct task_struct __kptr *task;
> > >                 ...
> > >         };
> > >
> > > Then, in a BPF program, user may store PTR_TO_BTF_ID with the type
> > > task_struct into the map, and then load it later.
> > >
> > > Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as
> > > the verifier cannot know whether the value is NULL or not statically, it
> > > must treat all potential loads at that map value offset as loading a
> > > possibly NULL pointer.
> > >
> > > Only BPF_LDX, BPF_STX, and BPF_ST with insn->imm = 0 (to denote NULL)
> > > are allowed instructions that can access such a pointer. On BPF_LDX, the
> > > destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX,
> > > it is checked whether the source register type is a PTR_TO_BTF_ID with
> > > same BTF type as specified in the map BTF. The access size must always
> > > be BPF_DW.
> > >
> > > For the map in map support, the kptr_off_tab for outer map is copied
> > > from the inner map's kptr_off_tab. It was chosen to do a deep copy
> > > instead of introducing a refcount to kptr_off_tab, because the copy only
> > > needs to be done when paramterizing using inner_map_fd in the map in map
> > > case, hence would be unnecessary for all other users.
> > >
> > > It is not permitted to use MAP_FREEZE command and mmap for BPF map
> > > having kptr, similar to the bpf_timer case.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
> > > ---
> > >  include/linux/bpf.h     |  29 +++++++-
> > >  include/linux/btf.h     |   2 +
> > >  kernel/bpf/btf.c        | 161 ++++++++++++++++++++++++++++++++++------
> > >  kernel/bpf/map_in_map.c |   5 +-
> > >  kernel/bpf/syscall.c    | 112 +++++++++++++++++++++++++++-
> > >  kernel/bpf/verifier.c   | 120 ++++++++++++++++++++++++++++++
> > >  6 files changed, 401 insertions(+), 28 deletions(-)
> > >
> >
> > [...]
> >
> > > +static int btf_find_field_kptr(const struct btf *btf, const struct btf_type *t,
> > > +                              u32 off, int sz, struct btf_field_info *info)
> > > +{
> > > +       /* For PTR, sz is always == 8 */
> > > +       if (!btf_type_is_ptr(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       t = btf_type_by_id(btf, t->type);
> > > +
> > > +       if (!btf_type_is_type_tag(t))
> > > +               return BTF_FIELD_IGNORE;
> > > +       /* Reject extra tags */
> > > +       if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
> > > +               return -EINVAL;
> >
> > Can we have tag -> const -> tag -> volatile -> tag in BTF? Wouldn't
> > you assume there are no more tags with just this check?
> >
>
> All tags are supposed to be before other modifiers, so tags come first, in
> continuity. See [0].

Doesn't seem like kernel's BTF validator enforces this, we should
probably tighten that up a bit. Clang won't emit such BTF, but nothing
prevents user from generating non-conformant BTF on its own either.

>
> Alexei suggested to reject all other tags for now.
>
>  [0]: https://lore.kernel.org/bpf/20220127154627.665163-1-yhs@xxxxxx
>
> >
> > > +       if (strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
> > > +               return -EINVAL;
> > > +
> > > +       /* Get the base type */
> > > +       if (btf_type_is_modifier(t))
> > > +               t = btf_type_skip_modifiers(btf, t->type, NULL);
> > > +       /* Only pointer to struct is allowed */
> > > +       if (!__btf_type_is_struct(t))
> > > +               return -EINVAL;
> > > +
> > > +       info->type = t;
> > > +       info->off = off;
> > > +       return BTF_FIELD_FOUND;
> > >  }

[...]

> > > +       if (map_value_has_kptr(map)) {
> > > +               if (!bpf_capable())
> > > +                       return -EPERM;
> > > +               if (map->map_flags & BPF_F_RDONLY_PROG) {
> > > +                       ret = -EACCES;
> > > +                       goto free_map_tab;
> > > +               }
> > > +               if (map->map_type != BPF_MAP_TYPE_HASH &&
> > > +                   map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> > > +                   map->map_type != BPF_MAP_TYPE_ARRAY) {
> >
> > what about PERCPU_ARRAY, for instance? Is there something
> > fundamentally wrong to support it for local storage maps?
> >
>
> Plugging in support into maps that already take timers was easier to begin, I
> can do percpu support as a follow up.
>
> In case of local storage, I'm a little worried about how we prevent creating
> reference cycles. There was a thread where find_get_task_by_pid was proposed as
> unstable helper, once we e.g. support embedding task_struct in map, and allow
> storing such pointer in task local storage, it would be pretty easy to construct
> a circular reference cycle.
>
> Should we think about this now, or should we worry about this when task_struct
> is actually supported as kptr? It's not only task_struct, same applies to sock.
>
> There's a discussion to be had, hence I left it out for now.

PERCPU_ARRAY seemed (and still seems) like a safe map to support (same
as PERCPU_HASH), which is why I asked. I see concerns about local
storage, though, thanks.

>
> > > +                       ret = -EOPNOTSUPP;
> > > +                       goto free_map_tab;
> > > +               }
> > > +       }
> > > +
> > > +       if (map->ops->map_check_btf) {
> > >                 ret = map->ops->map_check_btf(map, btf, key_type, value_type);
> > > +               if (ret < 0)
> > > +                       goto free_map_tab;
> > > +       }
> > >
> > > +       return ret;
> > > +free_map_tab:
> > > +       bpf_map_free_kptr_off_tab(map);
> > >         return ret;
> > >  }
> > >
> > > @@ -1639,7 +1745,7 @@ static int map_freeze(const union bpf_attr *attr)
> > >                 return PTR_ERR(map);
> > >
> > >         if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> > > -           map_value_has_timer(map)) {
> > > +           map_value_has_timer(map) || map_value_has_kptr(map)) {
> > >                 fdput(f);
> > >                 return -ENOTSUPP;
> > >         }
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 4ce9a528fb63..744b7362e52e 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -3507,6 +3507,94 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
> > >         return __check_ptr_off_reg(env, reg, regno, false);
> > >  }
> > >
> > > +static int map_kptr_match_type(struct bpf_verifier_env *env,
> > > +                              struct bpf_map_value_off_desc *off_desc,
> > > +                              struct bpf_reg_state *reg, u32 regno)
> > > +{
> > > +       const char *targ_name = kernel_type_name(off_desc->btf, off_desc->btf_id);
> > > +       const char *reg_name = "";
> > > +
> > > +       if (reg->type != PTR_TO_BTF_ID && reg->type != PTR_TO_BTF_ID_OR_NULL)
> >
> > base_type(reg->type) != PTR_TO_BTF_ID ?
> >
> > > +               goto bad_type;
> > > +
> > > +       if (!btf_is_kernel(reg->btf)) {
> > > +               verbose(env, "R%d must point to kernel BTF\n", regno);
> > > +               return -EINVAL;
> > > +       }
> > > +       /* We need to verify reg->type and reg->btf, before accessing reg->btf */
> > > +       reg_name = kernel_type_name(reg->btf, reg->btf_id);
> > > +
> > > +       if (__check_ptr_off_reg(env, reg, regno, true))
> > > +               return -EACCES;
> > > +
> > > +       if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
> > > +                                 off_desc->btf, off_desc->btf_id))
> > > +               goto bad_type;
> > > +       return 0;
> > > +bad_type:
> > > +       verbose(env, "invalid kptr access, R%d type=%s%s ", regno,
> > > +               reg_type_str(env, reg->type), reg_name);
> > > +       verbose(env, "expected=%s%s\n", reg_type_str(env, PTR_TO_BTF_ID), targ_name);
> >
> > why two separate verbose calls, you can easily combine them (and they
> > should be output on a single line given it's a single error)
> >
>
> reg_type_str cannot be called more than once in the same statement, since it
> reuses the same buffer.
>

ah, subtle, ok, never mind then, no big deal to have two verbose()
calls if there is a reason for it

> > > +       return -EINVAL;
> > > +}
> > > +
> >
> > [...]
>
> --
> Kartikeya