Re: [PATCH v2 bpf-next 1/4] bpf: introduce task_vma bpf_iter

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 12/17/20 6:34 PM, Alexei Starovoitov wrote:
On Thu, Dec 17, 2020 at 10:08:31PM +0000, Song Liu wrote:


On Dec 17, 2020, at 11:03 AM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:

On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote:
+/*
+ * Key information from vm_area_struct. We need this because we cannot
+ * assume the vm_area_struct is still valid after each iteration.
+ */
+struct __vm_area_struct {
+	__u64 start;
+	__u64 end;
+	__u64 flags;
+	__u64 pgoff;
+};

Where it's inside .c or exposed in uapi/bpf.h it will become uapi
if it's used this way. Let's switch to arbitrary BTF-based access instead.

+static struct __vm_area_struct *
+task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)
+{
+	struct pid_namespace *ns = info->common.ns;
+	struct task_struct *curr_task;
+	struct vm_area_struct *vma;
+	u32 curr_tid = info->tid;
+	bool new_task = false;
+
+	/* If this function returns a non-NULL __vm_area_struct, it held
+	 * a reference to the task_struct. If info->file is non-NULL, it
+	 * also holds a reference to the file. If this function returns
+	 * NULL, it does not hold any reference.
+	 */
+again:
+	if (info->task) {
+		curr_task = info->task;
+	} else {
+		curr_task = task_seq_get_next(ns, &curr_tid, true);
+		if (!curr_task) {
+			info->task = NULL;
+			info->tid++;
+			return NULL;
+		}
+
+		if (curr_tid != info->tid) {
+			info->tid = curr_tid;
+			new_task = true;
+		}
+
+		if (!curr_task->mm)
+			goto next_task;
+		info->task = curr_task;
+	}
+
+	mmap_read_lock(curr_task->mm);

That will hurt. /proc readers do that and it causes all sorts
of production issues. We cannot take this lock.
There is no need to take it.
Switch the whole thing to probe_read style walking.
And reimplement find_vma with probe_read while omitting vmacache.
It will be short rbtree walk.
bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id
which will use probe_read equivalent underneath.

rw_semaphore is designed to avoid write starvation, so read_lock should not cause
problem unless the lock was taken for extended period. [1] was a recent fix that
avoids /proc issue by releasing mmap_lock between iterations. We are using similar
mechanism here. BTW: I changed this to mmap_read_lock_killable() in the next version.

On the other hand, we need a valid vm_file pointer for bpf_d_path. So walking the

ahh. I missed that. Makes sense.
vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id.

rbtree without taking any lock would not work. We can avoid taking the lock when
some SPF like mechanism merged (hopefully soonish).

Did I miss anything?

We can improve bpf_iter with some mechanism to specify which task to iterate, so
that we don't have to iterate through all tasks when the user only want to inspect
vmas in one task.

yes. let's figure out how to make it parametrizable.
map_iter runs only for given map_fd.
Maybe vma_iter should run only for given pidfd?
I think all_task_all_vmas iter is nice to have, but we don't really need it?

We could make pidfd optional? If pidfd == 0, all vmas of all tasks will be traversed. Otherwise, pidfd > 0, only vmas of that pidfd will be traversed.


Thanks,
Song

[1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock")

Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed.




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux