On Thu, Dec 17, 2020 at 10:08:31PM +0000, Song Liu wrote: > > > > On Dec 17, 2020, at 11:03 AM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote: > >> +/* > >> + * Key information from vm_area_struct. We need this because we cannot > >> + * assume the vm_area_struct is still valid after each iteration. > >> + */ > >> +struct __vm_area_struct { > >> + __u64 start; > >> + __u64 end; > >> + __u64 flags; > >> + __u64 pgoff; > >> +}; > > > > Where it's inside .c or exposed in uapi/bpf.h it will become uapi > > if it's used this way. Let's switch to arbitrary BTF-based access instead. > > > >> +static struct __vm_area_struct * > >> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info) > >> +{ > >> + struct pid_namespace *ns = info->common.ns; > >> + struct task_struct *curr_task; > >> + struct vm_area_struct *vma; > >> + u32 curr_tid = info->tid; > >> + bool new_task = false; > >> + > >> + /* If this function returns a non-NULL __vm_area_struct, it held > >> + * a reference to the task_struct. If info->file is non-NULL, it > >> + * also holds a reference to the file. If this function returns > >> + * NULL, it does not hold any reference. > >> + */ > >> +again: > >> + if (info->task) { > >> + curr_task = info->task; > >> + } else { > >> + curr_task = task_seq_get_next(ns, &curr_tid, true); > >> + if (!curr_task) { > >> + info->task = NULL; > >> + info->tid++; > >> + return NULL; > >> + } > >> + > >> + if (curr_tid != info->tid) { > >> + info->tid = curr_tid; > >> + new_task = true; > >> + } > >> + > >> + if (!curr_task->mm) > >> + goto next_task; > >> + info->task = curr_task; > >> + } > >> + > >> + mmap_read_lock(curr_task->mm); > > > > That will hurt. /proc readers do that and it causes all sorts > > of production issues. We cannot take this lock. > > There is no need to take it. > > Switch the whole thing to probe_read style walking. > > And reimplement find_vma with probe_read while omitting vmacache. > > It will be short rbtree walk. > > bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id > > which will use probe_read equivalent underneath. > > rw_semaphore is designed to avoid write starvation, so read_lock should not cause > problem unless the lock was taken for extended period. [1] was a recent fix that > avoids /proc issue by releasing mmap_lock between iterations. We are using similar > mechanism here. BTW: I changed this to mmap_read_lock_killable() in the next version. > > On the other hand, we need a valid vm_file pointer for bpf_d_path. So walking the ahh. I missed that. Makes sense. vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id. > rbtree without taking any lock would not work. We can avoid taking the lock when > some SPF like mechanism merged (hopefully soonish). > > Did I miss anything? > > We can improve bpf_iter with some mechanism to specify which task to iterate, so > that we don't have to iterate through all tasks when the user only want to inspect > vmas in one task. yes. let's figure out how to make it parametrizable. map_iter runs only for given map_fd. Maybe vma_iter should run only for given pidfd? I think all_task_all_vmas iter is nice to have, but we don't really need it? > Thanks, > Song > > [1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed.