Re: [PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 1 Aug 2023 13:41:27 -0700

On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
>
> At Meta we have a profiling daemon which periodically collects
> information on many hosts. This collection usually involves grabbing
> stacks (user and kernel) using perf_event BPF progs and later symbolicating
> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
> those cases we must fall back to digging around in /proc/PID/maps to map
> virtual address to (binary, offset). The /proc/PID/maps digging does not
> occur synchronously with stack collection, so the process might already
> be gone, in which case it won't have /proc/PID/maps and we will fail to
> symbolicate.
>
> This 'exited process problem' doesn't occur very often as
> most of the prod services we care to profile are long-lived daemons,
> there are enough usecases to warrant a workaround: a BPF program which
> can be optionally loaded at data collection time and essentially walks
> /proc/PID/maps. Currently this is done by walking the vma list:
>
>   struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
>   mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>
> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
> longer a vma linked list to walk. Walking the vma maple tree is not as
> simple as hopping struct vm_area_struct->vm_next. That commit replaces
> vm_next hopping with calls to find_vma(mm, addr) helper function, which
> returns the vma containing addr, or if no vma contains addr,
> the closest vma with higher start addr.
>
> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
> find_vma, with the major difference that no 'closest vma' is returned if
> there is no VMA containing a particular address. This prevents BPF
> programs from being able to use bpf_find_vma to iterate all vmas in a
> task in a reasonable way.
>
> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
> 'closest vma' behavior when used. Because this is find_vma's default
> behavior it's as straightforward as nerfing a 'vma contains addr' check
> on find_vma retval.
>
> Also, change bpf_find_vma's address parameter to 'addr' instead of
> 'start'. The former is used in documentation and more accurately
> describes the param.
>
> [
>   RFC: This isn't an ideal solution for iteration of all vmas in a task
>        in the long term for a few reasons:
>
>      * In nmi context, second call to bpf_find_vma will fail because
>        irq_work is busy, so can't iterate all vmas
>      * Repeatedly taking and releasing mmap_read lock when a dedicated
>        iterate_all_vmas(task) kfunc could just take it once and hold for
>        all vmas
>
>     My specific usecase doesn't do vma iteration in nmi context and I
>     think the 'closest vma' behavior can be useful here despite locking
>     inefficiencies.
>
>     When Alexei and I discussed this offline, two alternatives to
>     provide similar functionality while addressing above issues seemed
>     reasonable:
>
>       * open-coded iterator for task vma. Similar to existing
>         task_vma bpf_iter, but no need to create a bpf_link and read
>         bpf_iter fd from userspace.
>       * New kfunc taking callback similar bpf_find_vma, but iterating
>         over all vmas in one go
>
>      I think this patch is useful on its own since it's a fairly minimal
>      change and fixes my usecase. Sending for early feedback and to
>      solicit further thought about whether this should be dropped in
>      favor of one of the above options.

- In theory this patch can work, but patch 2 didn't attempt to actually
use it in a loop to iterate all vma-s.
Which is a bit of red flag whether such iteration is practical
(either via bpf_loop or bpf_for).

- This behavior of bpf_find_vma() feels too much implementation detail.
find_vma will probably stay this way, since different parts of the kernel
rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.

- Looking at task_vma_seq_get_next().. that's how vma iter should be done and
I don't think bpf prog can do it on its own.
Because with bpf_find_vma() the lock will drop at every step the problems
described at that large comment will be hit sooner or later.

All concerns combined I feel we better provide a new kfunc that iterates vma
and drops the lock before invoking callback.
It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
Maybe it's ok.
Doing it open coded iterators style is likely better.
bpf_iter_vma_new() kfunc will do
bpf_mmap_unlock_get_irq_work+mmap_read_trylock
while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.

I'd try to do open-code-iter first. It's a good test for the iter infra.
bpf_iter_testmod_seq_new is an example of how to add a new iter.

Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
It's not a trusted arg. We better move away from this legacy pointer.
bpf_iter_vma_new() should accept only trusted ptr to task_struct.
fwiw bpf_get_current_task_btf_proto has
.ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
The bpf prog might look like:
task = bpf_get_current_task_btf();
err = bpf_iter_vma_new(&it, task);
while ((vma = bpf_iter_vma_next(&it))) ...;
assuming lock is not dropped by _next.