[PATCH v1 bpf-next 1/2] [RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



At Meta we have a profiling daemon which periodically collects
information on many hosts. This collection usually involves grabbing
stacks (user and kernel) using perf_event BPF progs and later symbolicating
them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
those cases we must fall back to digging around in /proc/PID/maps to map
virtual address to (binary, offset). The /proc/PID/maps digging does not
occur synchronously with stack collection, so the process might already
be gone, in which case it won't have /proc/PID/maps and we will fail to
symbolicate.

This 'exited process problem' doesn't occur very often as
most of the prod services we care to profile are long-lived daemons,
there are enough usecases to warrant a workaround: a BPF program which
can be optionally loaded at data collection time and essentially walks
/proc/PID/maps. Currently this is done by walking the vma list:

  struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
  mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */

Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
longer a vma linked list to walk. Walking the vma maple tree is not as
simple as hopping struct vm_area_struct->vm_next. That commit replaces
vm_next hopping with calls to find_vma(mm, addr) helper function, which
returns the vma containing addr, or if no vma contains addr,
the closest vma with higher start addr.

The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
find_vma, with the major difference that no 'closest vma' is returned if
there is no VMA containing a particular address. This prevents BPF
programs from being able to use bpf_find_vma to iterate all vmas in a
task in a reasonable way.

This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
'closest vma' behavior when used. Because this is find_vma's default
behavior it's as straightforward as nerfing a 'vma contains addr' check
on find_vma retval.

Also, change bpf_find_vma's address parameter to 'addr' instead of
'start'. The former is used in documentation and more accurately
describes the param.

[
  RFC: This isn't an ideal solution for iteration of all vmas in a task
       in the long term for a few reasons:

     * In nmi context, second call to bpf_find_vma will fail because
       irq_work is busy, so can't iterate all vmas
     * Repeatedly taking and releasing mmap_read lock when a dedicated
       iterate_all_vmas(task) kfunc could just take it once and hold for
       all vmas

    My specific usecase doesn't do vma iteration in nmi context and I
    think the 'closest vma' behavior can be useful here despite locking
    inefficiencies.

    When Alexei and I discussed this offline, two alternatives to
    provide similar functionality while addressing above issues seemed
    reasonable:

      * open-coded iterator for task vma. Similar to existing
        task_vma bpf_iter, but no need to create a bpf_link and read
	bpf_iter fd from userspace.
      * New kfunc taking callback similar bpf_find_vma, but iterating
        over all vmas in one go

     I think this patch is useful on its own since it's a fairly minimal
     change and fixes my usecase. Sending for early feedback and to
     solicit further thought about whether this should be dropped in
     favor of one of the above options.
]

Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
Cc: Nathan Slingerland <slinger@xxxxxxxx>
---
 include/uapi/linux/bpf.h       | 14 ++++++++++++--
 kernel/bpf/task_iter.c         | 12 ++++++++----
 tools/include/uapi/linux/bpf.h | 14 ++++++++++++--
 3 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 70da85200695..947187d76ebc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5169,8 +5169,13 @@ union bpf_attr {
  *		function with *task*, *vma*, and *callback_ctx*.
  *		The *callback_fn* should be a static function and
  *		the *callback_ctx* should be a pointer to the stack.
- *		The *flags* is used to control certain aspects of the helper.
- *		Currently, the *flags* must be 0.
+ *		The *flags* is used to control certain aspects of the helper and
+ *		may be one of the following:
+ *
+ *		**BPF_F_VMA_NEXT**
+ *			If no vma contains *addr*, call *callback_fn* with the next vma,
+ *			i.e. the vma with lowest vm_start that is higher than *addr*.
+ *			This replicates behavior of kernel's find_vma helper.
  *
  *		The expected callback signature is
  *
@@ -6026,6 +6031,11 @@ enum {
 	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
 };
 
+/* Flags for bpf_find_vma helper */
+enum {
+	BPF_F_VMA_NEXT		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c4ab9d6cdbe9..a8c87dcf36ad 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -777,7 +777,7 @@ static struct bpf_iter_reg task_vma_reg_info = {
 	.show_fdinfo		= bpf_iter_task_show_fdinfo,
 };
 
-BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
+BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, addr,
 	   bpf_callback_t, callback_fn, void *, callback_ctx, u64, flags)
 {
 	struct mmap_unlock_irq_work *work = NULL;
@@ -785,10 +785,13 @@ BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
 	bool irq_work_busy = false;
 	struct mm_struct *mm;
 	int ret = -ENOENT;
+	bool vma_next;
 
-	if (flags)
+	if (flags & ~BPF_F_VMA_NEXT)
 		return -EINVAL;
 
+	vma_next = flags & BPF_F_VMA_NEXT;
+
 	if (!task)
 		return -ENOENT;
 
@@ -801,9 +804,10 @@ BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
 	if (irq_work_busy || !mmap_read_trylock(mm))
 		return -EBUSY;
 
-	vma = find_vma(mm, start);
+	vma = find_vma(mm, addr);
 
-	if (vma && vma->vm_start <= start && vma->vm_end > start) {
+	if (vma &&
+	    ((vma->vm_start <= addr && vma->vm_end > addr) || vma_next)) {
 		callback_fn((u64)(long)task, (u64)(long)vma,
 			    (u64)(long)callback_ctx, 0, 0);
 		ret = 0;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 70da85200695..947187d76ebc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5169,8 +5169,13 @@ union bpf_attr {
  *		function with *task*, *vma*, and *callback_ctx*.
  *		The *callback_fn* should be a static function and
  *		the *callback_ctx* should be a pointer to the stack.
- *		The *flags* is used to control certain aspects of the helper.
- *		Currently, the *flags* must be 0.
+ *		The *flags* is used to control certain aspects of the helper and
+ *		may be one of the following:
+ *
+ *		**BPF_F_VMA_NEXT**
+ *			If no vma contains *addr*, call *callback_fn* with the next vma,
+ *			i.e. the vma with lowest vm_start that is higher than *addr*.
+ *			This replicates behavior of kernel's find_vma helper.
  *
  *		The expected callback signature is
  *
@@ -6026,6 +6031,11 @@ enum {
 	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
 };
 
+/* Flags for bpf_find_vma helper */
+enum {
+	BPF_F_VMA_NEXT		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.34.1






[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux