On Mon, Apr 8, 2024 at 2:30 AM Marco Elver <elver@xxxxxxxxxx> wrote: > > On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > > > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <elver@xxxxxxxxxx> wrote: > > > > > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov > > > <alexei.starovoitov@xxxxxxxxx> wrote: > [...] > > > > and the tasks can use mmaped array shared across all or unique to each > > > > process. > > > > And both bpf and user space can read/write them with a single instruction. > > > > > > That's BPF_F_MMAPABLE, right? > > > > > > That does not work because the mmapped region is global. Our requirements are: It sounds not like "requirements", but a description of the proposed solution. Pls share the actual use case. This "tracing prog" sounds more like a ghost scheduler that wants to interact with known user processes. > > > > > > 1. Single tracing BPF program. > > > > > > 2. Per-process (per VM) memory region (here it's per-thread, but each > > > thread just registers the same process-wide region). No sharing > > > between processes. > > > > > > 3. From #2 it follows: exec unregisters the registered memory region; > > > fork gets a cloned region. > > > > > > 4. Unprivileged processes can do prctl(REGISTER). Some of them might > > > not be able to use the bpf syscall. > > > > > > The reason for #2 is that each user space process also writes to the > > > memory region (read by the BPF program to make updates depending on > > > what state it finds), and having shared state between processes > > > doesn't work here. > > > > > > Is there any reasonable BPF facility that can do this today? (If > > > BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a > > > happy camper.) > > > > You could simulate something like this with multi-element ARRAY + > > BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of > > processes, so it's not an exact fit. > > Right, for production use this is infeasible. Last I heard, ghost agent and a few important tasks can mmap bpf array and share it with bpf prog. So quite feasible. > > > But what seems to be much closer is using BPF task-local storage, if > > we support mmap()'ing its memory into user-space. We've had previous > > discussions on how to achieve this (the simplest being that > > mmap(task_local_map_fd, ...) maps current thread's part of BPF task > > local storage). You won't get automatic cloning (you'd have to do that > > from the BPF program on fork/exec tracepoint, for example), and within > > the process you'd probably want to have just one thread (main?) to > > mmap() initially and just share the pointer across all relevant > > threads. > > In the way you imagine it, would that allow all threads sharing the > same memory, despite it being task-local? Presumably each task's local > storage would be mapped to just point to the same memory? > > > But this is a more generic building block, IMO. This relying > > on BPF map also means pinning is possible and all the other BPF map > > abstraction benefits. > > Deployment-wise it will make things harder because unprivileged > processes still have to somehow get the map's shared fd somehow to > mmap() it. Not unsolvable, and in general what you describe looks > interesting, but I currently can't see how it will be simpler. bpf map can be pinned into bpffs for any unpriv process to access. Then any task can bpf_obj_get it and mmap it. If you have few such tasks than bpf array will do. If you have millions of tasks then use bpf arena which is a sparse array. Use pid as an index or some other per-task id. Both bpf prog and all tasks can read/write such shared memory with normal load/store instructions. > In absence of all that, is a safer "bpf_probe_write_user()" like I > proposed in this patch ("bpf_probe_write_user_registered()") at all > appealing? To be honest, another "probe" variant is not appealing. It's pretty much bpf_probe_write_user without pr_warn_ratelimited. The main issue with bpf_probe_read/write_user() is their non-determinism. They will error when memory is swapped out. These helpers are ok-ish for observability when consumers understand that some events might be lost, but for 24/7 production use losing reads becomes a problem that bpf prog cannot mitigate. What do bpf prog suppose to do when this safer bpf_probe_write_user errors? Use some other mechanism to communicate with user space? A mechanism with such builtin randomness in behavior is a footgun for bpf users. We have bpf_copy_from_user*() that don't have this non-determinism. We can introduce bpf_copy_to_user(), but it will be usable from sleepable bpf prog. While it sounds you need it somewhere where scheduler makes decisions, so I suspect bpf array or arena is a better fit. Or something that extends bpf local storage map. See long discussion: https://lore.kernel.org/bpf/45878586-cc5f-435f-83fb-9a3c39824550@xxxxxxxxx/ I still like the idea to let user tasks register memory in bpf local storage map, the kernel will pin such pages, and then bpf prog can read/write these regions directly. In bpf prog it will be: ptr = bpf_task_storage_get(&map, task, ...); if (ptr) { *ptr = ... } and direct read/write into the same memory from user space.