On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@xxxxxx> wrote: > > At Meta we have a profiling daemon which periodically collects > information on many hosts. This collection usually involves grabbing > stacks (user and kernel) using perf_event BPF progs and later symbolicating > them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on > remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In > those cases we must fall back to digging around in /proc/PID/maps to map > virtual address to (binary, offset). The /proc/PID/maps digging does not > occur synchronously with stack collection, so the process might already > be gone, in which case it won't have /proc/PID/maps and we will fail to > symbolicate. > > This 'exited process problem' doesn't occur very often as > most of the prod services we care to profile are long-lived daemons, > there are enough usecases to warrant a workaround: a BPF program which > can be optionally loaded at data collection time and essentially walks > /proc/PID/maps. Currently this is done by walking the vma list: > > struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); > mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ > > Since commit 763ecb035029 ("mm: remove the vma linked list") there's no > longer a vma linked list to walk. Walking the vma maple tree is not as > simple as hopping struct vm_area_struct->vm_next. That commit replaces > vm_next hopping with calls to find_vma(mm, addr) helper function, which > returns the vma containing addr, or if no vma contains addr, > the closest vma with higher start addr. > > The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around > find_vma, with the major difference that no 'closest vma' is returned if > there is no VMA containing a particular address. This prevents BPF > programs from being able to use bpf_find_vma to iterate all vmas in a > task in a reasonable way. > > This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores > 'closest vma' behavior when used. Because this is find_vma's default > behavior it's as straightforward as nerfing a 'vma contains addr' check > on find_vma retval. > > Also, change bpf_find_vma's address parameter to 'addr' instead of > 'start'. The former is used in documentation and more accurately > describes the param. > > [ > RFC: This isn't an ideal solution for iteration of all vmas in a task > in the long term for a few reasons: > > * In nmi context, second call to bpf_find_vma will fail because > irq_work is busy, so can't iterate all vmas > * Repeatedly taking and releasing mmap_read lock when a dedicated > iterate_all_vmas(task) kfunc could just take it once and hold for > all vmas > > My specific usecase doesn't do vma iteration in nmi context and I > think the 'closest vma' behavior can be useful here despite locking > inefficiencies. > > When Alexei and I discussed this offline, two alternatives to > provide similar functionality while addressing above issues seemed > reasonable: > > * open-coded iterator for task vma. Similar to existing > task_vma bpf_iter, but no need to create a bpf_link and read > bpf_iter fd from userspace. > * New kfunc taking callback similar bpf_find_vma, but iterating > over all vmas in one go > > I think this patch is useful on its own since it's a fairly minimal > change and fixes my usecase. Sending for early feedback and to > solicit further thought about whether this should be dropped in > favor of one of the above options. - In theory this patch can work, but patch 2 didn't attempt to actually use it in a loop to iterate all vma-s. Which is a bit of red flag whether such iteration is practical (either via bpf_loop or bpf_for). - This behavior of bpf_find_vma() feels too much implementation detail. find_vma will probably stay this way, since different parts of the kernel rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much. - Looking at task_vma_seq_get_next().. that's how vma iter should be done and I don't think bpf prog can do it on its own. Because with bpf_find_vma() the lock will drop at every step the problems described at that large comment will be hit sooner or later. All concerns combined I feel we better provide a new kfunc that iterates vma and drops the lock before invoking callback. It can be much simpler than task_vma_seq_get_next() if we don't drop the lock. Maybe it's ok. Doing it open coded iterators style is likely better. bpf_iter_vma_new() kfunc will do bpf_mmap_unlock_get_irq_work+mmap_read_trylock while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm. I'd try to do open-code-iter first. It's a good test for the iter infra. bpf_iter_testmod_seq_new is an example of how to add a new iter. Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID. It's not a trusted arg. We better move away from this legacy pointer. bpf_iter_vma_new() should accept only trusted ptr to task_struct. fwiw bpf_get_current_task_btf_proto has .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here. The bpf prog might look like: task = bpf_get_current_task_btf(); err = bpf_iter_vma_new(&it, task); while ((vma = bpf_iter_vma_next(&it))) ...; assuming lock is not dropped by _next.