On 8/1/23 4:41 PM, Alexei Starovoitov wrote: > On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@xxxxxx> wrote: >> >> At Meta we have a profiling daemon which periodically collects >> information on many hosts. This collection usually involves grabbing >> stacks (user and kernel) using perf_event BPF progs and later symbolicating >> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on >> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In >> those cases we must fall back to digging around in /proc/PID/maps to map >> virtual address to (binary, offset). The /proc/PID/maps digging does not >> occur synchronously with stack collection, so the process might already >> be gone, in which case it won't have /proc/PID/maps and we will fail to >> symbolicate. >> >> This 'exited process problem' doesn't occur very often as >> most of the prod services we care to profile are long-lived daemons, >> there are enough usecases to warrant a workaround: a BPF program which >> can be optionally loaded at data collection time and essentially walks >> /proc/PID/maps. Currently this is done by walking the vma list: >> >> struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); >> mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ >> >> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no >> longer a vma linked list to walk. Walking the vma maple tree is not as >> simple as hopping struct vm_area_struct->vm_next. That commit replaces >> vm_next hopping with calls to find_vma(mm, addr) helper function, which >> returns the vma containing addr, or if no vma contains addr, >> the closest vma with higher start addr. >> >> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around >> find_vma, with the major difference that no 'closest vma' is returned if >> there is no VMA containing a particular address. This prevents BPF >> programs from being able to use bpf_find_vma to iterate all vmas in a >> task in a reasonable way. >> >> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores >> 'closest vma' behavior when used. Because this is find_vma's default >> behavior it's as straightforward as nerfing a 'vma contains addr' check >> on find_vma retval. >> >> Also, change bpf_find_vma's address parameter to 'addr' instead of >> 'start'. The former is used in documentation and more accurately >> describes the param. >> >> [ >> RFC: This isn't an ideal solution for iteration of all vmas in a task >> in the long term for a few reasons: >> >> * In nmi context, second call to bpf_find_vma will fail because >> irq_work is busy, so can't iterate all vmas >> * Repeatedly taking and releasing mmap_read lock when a dedicated >> iterate_all_vmas(task) kfunc could just take it once and hold for >> all vmas >> >> My specific usecase doesn't do vma iteration in nmi context and I >> think the 'closest vma' behavior can be useful here despite locking >> inefficiencies. >> >> When Alexei and I discussed this offline, two alternatives to >> provide similar functionality while addressing above issues seemed >> reasonable: >> >> * open-coded iterator for task vma. Similar to existing >> task_vma bpf_iter, but no need to create a bpf_link and read >> bpf_iter fd from userspace. >> * New kfunc taking callback similar bpf_find_vma, but iterating >> over all vmas in one go >> >> I think this patch is useful on its own since it's a fairly minimal >> change and fixes my usecase. Sending for early feedback and to >> solicit further thought about whether this should be dropped in >> favor of one of the above options. > > - In theory this patch can work, but patch 2 didn't attempt to actually > use it in a loop to iterate all vma-s. > Which is a bit of red flag whether such iteration is practical > (either via bpf_loop or bpf_for). > > - This behavior of bpf_find_vma() feels too much implementation detail. > find_vma will probably stay this way, since different parts of the kernel > rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much. > > - Looking at task_vma_seq_get_next().. that's how vma iter should be done and > I don't think bpf prog can do it on its own. > Because with bpf_find_vma() the lock will drop at every step the problems > described at that large comment will be hit sooner or later. > > All concerns combined I feel we better provide a new kfunc that iterates vma > and drops the lock before invoking callback. > It can be much simpler than task_vma_seq_get_next() if we don't drop the lock. > Maybe it's ok. > Doing it open coded iterators style is likely better. > bpf_iter_vma_new() kfunc will do > bpf_mmap_unlock_get_irq_work+mmap_read_trylock > while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm. > > I'd try to do open-code-iter first. It's a good test for the iter infra. > bpf_iter_testmod_seq_new is an example of how to add a new iter. > > Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID. > It's not a trusted arg. We better move away from this legacy pointer. > bpf_iter_vma_new() should accept only trusted ptr to task_struct. > fwiw bpf_get_current_task_btf_proto has > .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here. > The bpf prog might look like: > task = bpf_get_current_task_btf(); > err = bpf_iter_vma_new(&it, task); > while ((vma = bpf_iter_vma_next(&it))) ...; > assuming lock is not dropped by _next. The only concern here that doesn't seem reasonable to me is the "too much implementation detail". I agree with the rest, though, so will send a different series with new implementation and point to this discussion.