Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 8 Apr 2024 11:24:19 -0700

On Mon, Apr 8, 2024 at 2:30 AM Marco Elver <elver@xxxxxxxxxx> wrote:
>
> On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <elver@xxxxxxxxxx> wrote:
> > >
> > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> [...]
> > > > and the tasks can use mmaped array shared across all or unique to each
> > > > process.
> > > > And both bpf and user space can read/write them with a single instruction.
> > >
> > > That's BPF_F_MMAPABLE, right?
> > >
> > > That does not work because the mmapped region is global. Our requirements are:

It sounds not like "requirements", but a description of the proposed
solution.
Pls share the actual use case.
This "tracing prog" sounds more like a ghost scheduler that
wants to interact with known user processes.

> > >
> > > 1. Single tracing BPF program.
> > >
> > > 2. Per-process (per VM) memory region (here it's per-thread, but each
> > > thread just registers the same process-wide region).  No sharing
> > > between processes.
> > >
> > > 3. From #2 it follows: exec unregisters the registered memory region;
> > > fork gets a cloned region.
> > >
> > > 4. Unprivileged processes can do prctl(REGISTER). Some of them might
> > > not be able to use the bpf syscall.
> > >
> > > The reason for #2 is that each user space process also writes to the
> > > memory region (read by the BPF program to make updates depending on
> > > what state it finds), and having shared state between processes
> > > doesn't work here.
> > >
> > > Is there any reasonable BPF facility that can do this today? (If
> > > BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
> > > happy camper.)
> >
> > You could simulate something like this with multi-element ARRAY +
> > BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of
> > processes, so it's not an exact fit.
>
> Right, for production use this is infeasible.

Last I heard, ghost agent and a few important tasks can mmap bpf array
and share it with bpf prog.
So quite feasible.

>
> > But what seems to be much closer is using BPF task-local storage, if
> > we support mmap()'ing its memory into user-space. We've had previous
> > discussions on how to achieve this (the simplest being that
> > mmap(task_local_map_fd, ...) maps current thread's part of BPF task
> > local storage). You won't get automatic cloning (you'd have to do that
> > from the BPF program on fork/exec tracepoint, for example), and within
> > the process you'd probably want to have just one thread (main?) to
> > mmap() initially and just share the pointer across all relevant
> > threads.
>
> In the way you imagine it, would that allow all threads sharing the
> same memory, despite it being task-local? Presumably each task's local
> storage would be mapped to just point to the same memory?
>
> > But this is a more generic building block, IMO. This relying
> > on BPF map also means pinning is possible and all the other BPF map
> > abstraction benefits.
>
> Deployment-wise it will make things harder because unprivileged
> processes still have to somehow get the map's shared fd somehow to
> mmap() it. Not unsolvable, and in general what you describe looks
> interesting, but I currently can't see how it will be simpler.

bpf map can be pinned into bpffs for any unpriv process to access.
Then any task can bpf_obj_get it and mmap it.
If you have few such tasks than bpf array will do.
If you have millions of tasks then use bpf arena which is a sparse array.
Use pid as an index or some other per-task id.
Both bpf prog and all tasks can read/write such shared memory
with normal load/store instructions.

> In absence of all that, is a safer "bpf_probe_write_user()" like I
> proposed in this patch ("bpf_probe_write_user_registered()") at all
> appealing?

To be honest, another "probe" variant is not appealing.
It's pretty much bpf_probe_write_user without pr_warn_ratelimited.
The main issue with bpf_probe_read/write_user() is their non-determinism.
They will error when memory is swapped out.
These helpers are ok-ish for observability when consumers understand
that some events might be lost, but for 24/7 production use
losing reads becomes a problem that bpf prog cannot mitigate.
What do bpf prog suppose to do when this safer bpf_probe_write_user errors?
Use some other mechanism to communicate with user space?
A mechanism with such builtin randomness in behavior is a footgun for
bpf users.
We have bpf_copy_from_user*() that don't have this non-determinism.
We can introduce bpf_copy_to_user(), but it will be usable
from sleepable bpf prog.
While it sounds you need it somewhere where scheduler makes decisions,
so I suspect bpf array or arena is a better fit.

Or something that extends bpf local storage map.
See long discussion:
https://lore.kernel.org/bpf/45878586-cc5f-435f-83fb-9a3c39824550@xxxxxxxxx/

I still like the idea to let user tasks register memory in
bpf local storage map, the kernel will pin such pages,
and then bpf prog can read/write these regions directly.
In bpf prog it will be:
ptr = bpf_task_storage_get(&map, task, ...);
if (ptr) { *ptr = ... }
and direct read/write into the same memory from user space.