Re: [PATCH bpf] bpf: Fix crash due to OOB access when reg->type > __BPF_REG_TYPE_MAX

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Thu, 17 Feb 2022 03:53:41 +0530

On Thu, Feb 17, 2022 at 03:14:05AM IST, Alexei Starovoitov wrote:
> On Thu, Feb 17, 2022 at 02:28:42AM +0530, Kumar Kartikeya Dwivedi wrote:
> > On Thu, Feb 17, 2022 at 01:19:56AM IST, Alexei Starovoitov wrote:
> > > On Wed, Feb 16, 2022 at 11:59:54PM +0530, Kumar Kartikeya Dwivedi wrote:
> > > > > > > Just use reg2btf_ids[base_type(reg->type)] instead?
> > > > > >
> > > > > > That would be incorrect I think, then we'd allow e.g. PTR_TO_TCP_SOCK_OR_NULL
> > > > > > and treat it as PTR_TO_TCP_SOCK, while current code only wants to permit
> > > > > > non-NULL variants for these three.
> > > > >
> > > > > add && !type_flag(reg->type) ?
> > > > >
> > > >
> > > > Ok, we can do that. WRT Hao's suggestion, we do allow NULL for ptr_to_mem_ok
> > > > case, so doing it for all of check_func_arg_match won't work.
> > >
> > > Right.
> > >
> > > > > But, first, please describe how you found it.
> > > > > Tried to pass PTR_TO_BTF_ID_OR_NULL into kfunc?
> > > > > Do you have some other changes in the verifier?
> > > >
> > > > Yes, was working on [0], tried to come up with a test case where verifier
> > > > printed bad register type being passed (one marked with a new flag), but noticed
> > > > that it would fall out of __BPF_REG_TYPE_MAX limit during kfunc check. Also, it
> > > > seems on non-KASAN build it actually doesn't crash sometimes, depends on what
> > > > the value at that offset is.
> > > >
> > > >   [0]: https://github.com/kkdwivedi/linux/commits/btf-ptr-in-map
> > >
> > > Nice. Thanks a ton. I did a quick look. Seems to be ready to post on the list?
> > > Can you add .c tests and post?
> > >
> >
> > Thanks for the quick review!
> >
> > Yes, I should be able to post by this weekend if I don't run into any other
> > problems/bugs in my code.
>
> The sooner the better, so we can merge our efforts sooner.
>
> > > Only one suggestion so far:
> > > Could you get rid of callback in btf_find_struct_field ?
> > > The callbacks obscure the control flow and make the code harder to follow.
> > > Do refactor btf_find_struct_field, but call btf_find_ptr_to_btf_id directly from it.
> > > Then you wouldn't need this ugly part:
> > > /* While callback cleans up after itself, later iterations can still
> > >  * return error without calling cb, so call free function again.
> > >  */
> > > if (ret < 0)
> > >   bpf_map_free_ptr_off_tab(map);
> > >
> >
> > Ok, makes sense, will change.
> >
> > > I've been thinking about the api for refcnted PTR_TO_BTF_ID in the map too.
> > > Regarding your issue with cmpxchg. I've come up with two helpers:
> > >
> > > BPF_CALL_3(bpf_kptr_try_set, void **, kptr, void *, ptr, int, refcnt_off)
> > > {
> > >         /* ptr is ptr_to_btf_id returned from bpf_*_lookup() with ptr->refcnt >= 1
> > >          * refcount_inc() has to be done before cmpxchg() because
> > >          * another cpu might do bpf_kptr_xchg+release.
> > >          */
> > >         refcount_inc((refcount_t *)(ptr + refcnt_off));
> > >         if (cmpxchg(kptr, NULL, ptr)) {
> > >                 /* refcnt >= 2 here */
> > >                 refcount_dec((refcount_t *)(ptr + refcnt_off));
> > >                 return -EBUSY;
> > >         }
> > >         return 0;
> > > }
> > >
> > > and
> > >
> > > BPF_CALL_2(bpf_kptr_get, void **, kptr, int, refcnt_off)
> > > {
> > >         void *ptr = READ_ONCE(kptr);
> > >
> > >         if (!ptr)
> > >                 return 0;
> > >         /* ptr->refcnt could be == 0 if another cpu did
> > >          * ptr2 = bpf_kptr_xchg();
> > >          * bpf_*_release(ptr2);
> > >          */
> > >         if (!refcount_inc_not_zero((refcount_t *)(ptr + refcnt_off)))
> > >                 return 0;
> > >         return (long) ptr;
> > > }
> > >
> >
> > Good point. I did think about this problem. Right now when you load a referenced
> > pointer from the BPF map, we mark it as PTR_UNTRUSTED, so you can dereference
>
> you mean UNreferenced pointer, right? Since that's what your patch is doing.
>

You caught a bug :). Instead of...

	else if (!ref_ptr)
		reg_flags |= PTR_UNTRUSTED;

it should simply be 'else', then for non-percpu BTF ID pointer and non-user BTF
ID pointer, only XCHG (from referenced pointer) unsets PTR_UNTRUSTED, both
CMPXCHG and LDX have untrusted bit set during mark_btf_ld_reg.

> > but not pass into a helper like this, so my thinking was that BPF user will
> > already have an additional check (e.g. pointer being reachable from hash table
> > key) before accessing it, and that the xchg can happen after the other program
> > has removed the value's visibility globally (by deleting it from table). And for
> > RCU objects dereference of PTR_UNTRUSTED still works and reads valid data, they
> > can also skip reading by observing ptr->refcnt == 0 for dying case.
>
> I was thinking whether we can allow LDX to return PTR_TO_BTF_ID | PTR_UNTRUSTED
> and do explicit refcount_inc_not_zero done in bpf program by exposing
> functions like maybe_get_net() as kfunc and mark them as 'acquire' functions.

Ack, code does not agree (see above) but I will fix it to be like this.

> After acquire the pointer would lose PTR_UNTRUSTED flag.

Right, we can allow passing PTR_TO_BTF_ID | PTR_UNTRUSTED only for this specific
inc_not_zero helpers.

> That's another option instead of doing bpf_kptr_get().
> The advantage of bpf_kptr_get() that it's one way of operating with kernel pointers.
> No need to remember which kfunc to call.
> But on the other side explicit acquire call is more flexible.
>

Right, I'm leaning towards an explicit call as well, atleast for now, using
unstable kfunc.

> > So in case of fast path as long as you don't need to pass BTF ID into helpers,
> > you don't have to increment refcount.
>
> pass btf_id into helpers? What do you mean? Pass ptr_to_btf_id into helpers?

Sorry, yes.

> Right. If it's PTR_UNTRUSTED then having such struct as read-only would be ok,
> but that is a narrow use case.
> All kernel structs are beefy. They have a ton of stuff and would typically be
> passed into something else. (struct nf_conn *)->status si the only example
> I can think of where refcnt-ing is not needed.
>
> The refcnt inc/dec is very fast when it's not contended.
>
> > As you note, this is not enough, e.g. we might want to pass a referenced pointer
> > into helpers without removing it from the map. In that case such bpf_kptr_get
> > helper makes sense, but I think 1) It needs to be inlined, because refcount_off
> > will only be available easily at verification time (from BTF),
>
> Sorry that bit wasn't obvious. The refcnt_off argument to kptr_try_set and kptr_get
> will be provided by the verifier. The prog cannot pass it on its own.
> The verifier will do btf_id of arg2 -> offsetof(struct nf_conn, ct_general.use)
> and patch R3 with that constant before the call insn.
>
> So inlining of kptr_try_set and kptr_get is not needed.
>

Ah, right, we can pass a hidden parameter. I was under the impression you'd want
to hardcode the offset and call during verification/JIT time itself, though we
can always change how its implemented later. Makes no difference in terms of
performance I think.

> > 2) We cannot
> > guarantee that it keeps working for same object across kernel versions, that
> > ties kernel into a stability guarantee reg. internal changes,
>
> See explanation above. It will work across kernels. bpf progs don't need to change.
>
> > 3) We can only
> > support it for subset of objects (that have RCU protection), which we can
> > ideally detect from BTF itself, by marking them as such, or using a flag during
> > destructor registration (which is done for ref'd pointer case).
>
> The kptr_try_set, kptr_get, kptr_xchg will work for sleepable bpf progs too.
> rcu protection is not a requirement.
> Since btf_id -> offsetof(refcnt variable) mapping is done on the kernel side
> it will work as allowlist for kernel structs too.
> If btf_id_of(some kernel struct) is not there, then such btf_id cannot be
> persisted into the map value.
>
> >
> > >
> > > and instead of recognizing xchg insn directly in the verifier
> > > I went with the helper:
> > > BPF_CALL_2(bpf_kptr_xchg, void **, kptr, void *, ptr)
> > > {
> > >         /* ptr is ptr_to_btf_id returned from bpf_*_lookup() with ptr->refcnt >= 1
> > >          * or ptr == NULL.
> > >          * returns ptr_to_btf_id with refcnt >= 1 or NULL
> > >          */
> > >         return (long) xchg(kptr, ptr);
> > > }
> > > It was easier to deal with.
> > > I don't have a strong preference for helper vs insn though.
> > > Let's brainstorm after you post patches.
> > >
> > > xchg or bpf_kptr_xchg only operation seems to be too limiting.
> >
> > Right, one other usecase me and Toke were exploring is using this support to
> > enable XDP queueing. We can have a normal PIFO map and allow embedding xdp_frame
> > in it. Then xchg is used during enqueue and dequeue from the map, and user can
> > have additional stuff in map value (like embedding both bpf_timer and xdp_frame,
> > or extra scalars as data), which should allow more composition/flexibility.
> >
> > To also reduce the overhead of this xchg on fast path, we can consider relaxing
> > atomic restriction for per-CPU maps instead, then add a bpf_move_ptr helper that
> > is non-atomic, but ensures invariant that value in per-CPU map is either
> > refcounted or NULL, and only in program or map at any point of time. Then the
> > overhead of moving object ownership is also eliminated, and helper itself can be
> > inlined to a load from the map and a store to the map (of the new pointer or
> > NULL).
> >
> > From my notes:
> >
> >   6 * Introduce per-CPU map support, allow non-atomic update for ref'd pointer, and
> >   7   enforce consistent value using a ptr = bpf_move_ptr(&map_val_ptr->ptr, ptr);
> >   8
> >   9   To move value out of map: ptr = bpf_move_ptr(&val->ptr, NULL)
> >  10   To move value into the map: ptr = bpf_move_ptr(&val->ptr, ptr)
> >
> > In both cases, reference state for R0 (with PTR_MAYBE_NULL set) is created, and
> > since per-CPU map is only accessible to that CPU, we don't need additional
> > protections, as long as we also block userspace bpf_map_update_elem in this
> > case, or only permit storing such pointers with per-CPU map setting
> > BPF_MAP_F_RDONLY_USER flag.
> >
> > Will revisit this in detail when I post it, code for this is missing from
> > GitHub.
>
> Please hold on to that.
> It looks like premature optimization to me.
> xchg is fast when it's not contended.
> Adding all that per-cpu map logic won't make it faster. percpu array lookup
> will cost more than xchg.
> You might end up doing all that just to realize that it's slower.
> Let's land the main infra first.
>

Ok, will wait.

> > > For example we might want to remember struct task_struct or struct net
> > > in a map for faster access.
> > > The __bpf_nf_ct_lookup is doing get_net_ns_by_id().
> > > idr_find isn't fast enough for XDP.
> > > We can store refcnted 'struct net *' for faster lookup.
> > > Then multiple cpus will be reading that pointer the map, but if only 'xchg' is
> > > available that use case won't work. One cpu will win the race.
> > > So we need bpf_kptr_get() will be an acquire helper.
> > > The bpf prog will do an explicit put_net(ptr); which will be a release kfunc.
> > >
> > > Simply reading refcnted ptr_to_btf_id from a map is PTR_UNTRUSTED too for sleepable bpf progs.
> >
> > It is already marked PTR_UNTRUSTED for sleepable too, IIUC, but based on above I
> > guess we can remove the flag for objects which are RCU protected, in case of
> > non-sleepable progs.
>
> Probably not. Even non-sleepable prog under rcu_read_lock shouldn't be able
> to pass 'struct tcp_sock *' from the map into helpers.
> That sock can have refcnt==0. Even under rcu it's dangerous.
>

Good point, so instead of helpers doing the inc_not_zero each time, prog can do
it once and pass into other helpers.

> > We also need to prevent the kptr_get helper from being
> > called from sleepable progs.
>
> why is that?

My confusion comes because I don't think that normal call_rcu/synchronize_rcu
grace period for the object waits for RCU trace readers. In sleepable prog, that
same xchg+release can happen while we hold a untrusted pointer from LDX in the
RCU trace section. Then I think that kptr_get may be wrong, because object may
already have been freed without waiting for sleepable prog to call
rcu_read_unlock_trace.

Please correct me if I'm wrong, or misunderstood this case.

Or did you mean we will fix this for objects if exposing to sleepable progs, so
that we can make kptr_get work?

--
Kartikeya