At Meta we have a profiling daemon which periodically collects information on many hosts. This collection usually involves grabbing stacks (user and kernel) using perf_event BPF progs and later symbolicating them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In those cases we must fall back to digging around in /proc/PID/maps to map virtual address to (binary, offset). The /proc/PID/maps digging does not occur synchronously with stack collection, so the process might already be gone, in which case it won't have /proc/PID/maps and we will fail to symbolicate. This 'exited process problem' doesn't occur very often as most of the prod services we care to profile are long-lived daemons, there are enough usecases to warrant a workaround: a BPF program which can be optionally loaded at data collection time and essentially walks /proc/PID/maps. Currently this is done by walking the vma list: struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ Since commit 763ecb035029 ("mm: remove the vma linked list") there's no longer a vma linked list to walk. Walking the vma maple tree is not as simple as hopping struct vm_area_struct->vm_next. That commit replaces vm_next hopping with calls to find_vma(mm, addr) helper function, which returns the vma containing addr, or if no vma contains addr, the closest vma with higher start addr. The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around find_vma, with the major difference that no 'closest vma' is returned if there is no VMA containing a particular address. This prevents BPF programs from being able to use bpf_find_vma to iterate all vmas in a task in a reasonable way. This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores 'closest vma' behavior when used. Because this is find_vma's default behavior it's as straightforward as nerfing a 'vma contains addr' check on find_vma retval. Also, change bpf_find_vma's address parameter to 'addr' instead of 'start'. The former is used in documentation and more accurately describes the param. [ RFC: This isn't an ideal solution for iteration of all vmas in a task in the long term for a few reasons: * In nmi context, second call to bpf_find_vma will fail because irq_work is busy, so can't iterate all vmas * Repeatedly taking and releasing mmap_read lock when a dedicated iterate_all_vmas(task) kfunc could just take it once and hold for all vmas My specific usecase doesn't do vma iteration in nmi context and I think the 'closest vma' behavior can be useful here despite locking inefficiencies. When Alexei and I discussed this offline, two alternatives to provide similar functionality while addressing above issues seemed reasonable: * open-coded iterator for task vma. Similar to existing task_vma bpf_iter, but no need to create a bpf_link and read bpf_iter fd from userspace. * New kfunc taking callback similar bpf_find_vma, but iterating over all vmas in one go I think this patch is useful on its own since it's a fairly minimal change and fixes my usecase. Sending for early feedback and to solicit further thought about whether this should be dropped in favor of one of the above options. ] Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx> Cc: Nathan Slingerland <slinger@xxxxxxxx> --- include/uapi/linux/bpf.h | 14 ++++++++++++-- kernel/bpf/task_iter.c | 12 ++++++++---- tools/include/uapi/linux/bpf.h | 14 ++++++++++++-- 3 files changed, 32 insertions(+), 8 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 70da85200695..947187d76ebc 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5169,8 +5169,13 @@ union bpf_attr { * function with *task*, *vma*, and *callback_ctx*. * The *callback_fn* should be a static function and * the *callback_ctx* should be a pointer to the stack. - * The *flags* is used to control certain aspects of the helper. - * Currently, the *flags* must be 0. + * The *flags* is used to control certain aspects of the helper and + * may be one of the following: + * + * **BPF_F_VMA_NEXT** + * If no vma contains *addr*, call *callback_fn* with the next vma, + * i.e. the vma with lowest vm_start that is higher than *addr*. + * This replicates behavior of kernel's find_vma helper. * * The expected callback signature is * @@ -6026,6 +6031,11 @@ enum { BPF_F_EXCLUDE_INGRESS = (1ULL << 4), }; +/* Flags for bpf_find_vma helper */ +enum { + BPF_F_VMA_NEXT = (1ULL << 0), +}; + #define __bpf_md_ptr(type, name) \ union { \ type name; \ diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index c4ab9d6cdbe9..a8c87dcf36ad 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -777,7 +777,7 @@ static struct bpf_iter_reg task_vma_reg_info = { .show_fdinfo = bpf_iter_task_show_fdinfo, }; -BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, +BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, addr, bpf_callback_t, callback_fn, void *, callback_ctx, u64, flags) { struct mmap_unlock_irq_work *work = NULL; @@ -785,10 +785,13 @@ BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, bool irq_work_busy = false; struct mm_struct *mm; int ret = -ENOENT; + bool vma_next; - if (flags) + if (flags & ~BPF_F_VMA_NEXT) return -EINVAL; + vma_next = flags & BPF_F_VMA_NEXT; + if (!task) return -ENOENT; @@ -801,9 +804,10 @@ BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, if (irq_work_busy || !mmap_read_trylock(mm)) return -EBUSY; - vma = find_vma(mm, start); + vma = find_vma(mm, addr); - if (vma && vma->vm_start <= start && vma->vm_end > start) { + if (vma && + ((vma->vm_start <= addr && vma->vm_end > addr) || vma_next)) { callback_fn((u64)(long)task, (u64)(long)vma, (u64)(long)callback_ctx, 0, 0); ret = 0; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 70da85200695..947187d76ebc 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5169,8 +5169,13 @@ union bpf_attr { * function with *task*, *vma*, and *callback_ctx*. * The *callback_fn* should be a static function and * the *callback_ctx* should be a pointer to the stack. - * The *flags* is used to control certain aspects of the helper. - * Currently, the *flags* must be 0. + * The *flags* is used to control certain aspects of the helper and + * may be one of the following: + * + * **BPF_F_VMA_NEXT** + * If no vma contains *addr*, call *callback_fn* with the next vma, + * i.e. the vma with lowest vm_start that is higher than *addr*. + * This replicates behavior of kernel's find_vma helper. * * The expected callback signature is * @@ -6026,6 +6031,11 @@ enum { BPF_F_EXCLUDE_INGRESS = (1ULL << 4), }; +/* Flags for bpf_find_vma helper */ +enum { + BPF_F_VMA_NEXT = (1ULL << 0), +}; + #define __bpf_md_ptr(type, name) \ union { \ type name; \ -- 2.34.1