Re: [PATCH v3 bpf-next 13/15] bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Wed, 24 Aug 2022 21:49:30 +0200

On Sat, 20 Aug 2022 at 01:01, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Aug 19, 2022 at 3:56 PM Kumar Kartikeya Dwivedi
> <memxor@xxxxxxxxx> wrote:
> >
> > On Sat, 20 Aug 2022 at 00:43, Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Sat, Aug 20, 2022 at 12:21:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > On Fri, 19 Aug 2022 at 23:43, Alexei Starovoitov
> > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > >
> > > > > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > > >
> > > > > Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > > > Then use call_rcu() to wait for normal progs to finish
> > > > > and finally do free_one() on each element when freeing objects
> > > > > into global memory pool.
> > > > >
> > > > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > > > ---
> > > >
> > > > I fear this can make OOM issues very easy to run into, because one
> > > > sleepable prog that sleeps for a long period of time can hold the
> > > > freeing of elements from another sleepable prog which either does not
> > > > sleep often or sleeps for a very short period of time, and has a high
> > > > update frequency. I'm mostly worried that unrelated sleepable programs
> > > > not even using the same map will begin to affect each other.
> > >
> > > 'sleep for long time'? sleepable bpf prog doesn't mean that they can sleep.
> > > sleepable progs can copy_from_user, but they're not allowed to waste time.
> >
> > It is certainly possible to waste time, but indirectly, not through
> > the BPF program itself.
> >
> > If you have userfaultfd enabled (for unpriv users), an unprivileged
> > user can trap a sleepable BPF prog (say LSM) using bpf_copy_from_user
> > for as long as it wants. A similar case can be done using FUSE, IIRC.
> >
> > You can then say it's a problem about unprivileged users being able to
> > use userfaultfd or FUSE, or we could think about fixing
> > bpf_copy_from_user to return -EFAULT for this case, but it is totally
> > possible right now for malicious userspace to extend the tasks trace
> > gp like this for minutes (or even longer) on a system where sleepable
> > BPF programs are using e.g. bpf_copy_from_user.
>
> Well in that sense userfaultfd can keep all sorts of things
> in the kernel from making progress.
> But nothing to do with OOM.
> There is still the max_entries limit.
> The amount of objects in waiting_for_gp is guaranteed to be less
> than full prealloc.

My thinking was that once you hold the GP using uffd, we can assume
you will eventually hit a case where all such maps on the system have
their max_entries exhausted. So yes, it probably won't OOM, but it
would be bad regardless.

I think this just begs instead that uffd (and even FUSE) should not be
available to untrusted processes on the system by default. Both are
used regularly to widen hard to hit race conditions in the kernel.

But anyway, there's no easy way currently to guarantee the lifetime of
elements for the sleepable case while being as low overhead as trace
RCU, so it makes sense to go ahead with this.