> On Dec 17, 2020, at 6:34 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Thu, Dec 17, 2020 at 10:08:31PM +0000, Song Liu wrote: >> >> >>> On Dec 17, 2020, at 11:03 AM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: >>> >>> On Tue, Dec 15, 2020 at 03:36:59PM -0800, Song Liu wrote: >>>> +/* >>>> + * Key information from vm_area_struct. We need this because we cannot >>>> + * assume the vm_area_struct is still valid after each iteration. >>>> + */ >>>> +struct __vm_area_struct { >>>> + __u64 start; >>>> + __u64 end; >>>> + __u64 flags; >>>> + __u64 pgoff; >>>> +}; >>> >>> Where it's inside .c or exposed in uapi/bpf.h it will become uapi >>> if it's used this way. Let's switch to arbitrary BTF-based access instead. >>> >>>> +static struct __vm_area_struct * >>>> +task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info) >>>> +{ >>>> + struct pid_namespace *ns = info->common.ns; >>>> + struct task_struct *curr_task; >>>> + struct vm_area_struct *vma; >>>> + u32 curr_tid = info->tid; >>>> + bool new_task = false; >>>> + >>>> + /* If this function returns a non-NULL __vm_area_struct, it held >>>> + * a reference to the task_struct. If info->file is non-NULL, it >>>> + * also holds a reference to the file. If this function returns >>>> + * NULL, it does not hold any reference. >>>> + */ >>>> +again: >>>> + if (info->task) { >>>> + curr_task = info->task; >>>> + } else { >>>> + curr_task = task_seq_get_next(ns, &curr_tid, true); >>>> + if (!curr_task) { >>>> + info->task = NULL; >>>> + info->tid++; >>>> + return NULL; >>>> + } >>>> + >>>> + if (curr_tid != info->tid) { >>>> + info->tid = curr_tid; >>>> + new_task = true; >>>> + } >>>> + >>>> + if (!curr_task->mm) >>>> + goto next_task; >>>> + info->task = curr_task; >>>> + } >>>> + >>>> + mmap_read_lock(curr_task->mm); >>> >>> That will hurt. /proc readers do that and it causes all sorts >>> of production issues. We cannot take this lock. >>> There is no need to take it. >>> Switch the whole thing to probe_read style walking. >>> And reimplement find_vma with probe_read while omitting vmacache. >>> It will be short rbtree walk. >>> bpf prog doesn't need to see a stable struct. It will read it through ptr_to_btf_id >>> which will use probe_read equivalent underneath. >> >> rw_semaphore is designed to avoid write starvation, so read_lock should not cause >> problem unless the lock was taken for extended period. [1] was a recent fix that >> avoids /proc issue by releasing mmap_lock between iterations. We are using similar >> mechanism here. BTW: I changed this to mmap_read_lock_killable() in the next version. >> >> On the other hand, we need a valid vm_file pointer for bpf_d_path. So walking the > > ahh. I missed that. Makes sense. > vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id. Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will allow access of vma->vm_file as a valid pointer to struct file. However, since the vma might be freed, vma->vm_file could point to random data. > >> rbtree without taking any lock would not work. We can avoid taking the lock when >> some SPF like mechanism merged (hopefully soonish). >> >> Did I miss anything? >> >> We can improve bpf_iter with some mechanism to specify which task to iterate, so >> that we don't have to iterate through all tasks when the user only want to inspect >> vmas in one task. > > yes. let's figure out how to make it parametrizable. > map_iter runs only for given map_fd. > Maybe vma_iter should run only for given pidfd? > I think all_task_all_vmas iter is nice to have, but we don't really need it? > >> Thanks, >> Song >> >> [1] ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") > > Thanks for this link. With "if (mmap_lock_is_contended())" check it should work indeed. To make sure we are on the same page: I am using slightly different mechanism in task_vma_iter, which doesn't require checking mmap_lock_is_contended(). In the smaps_rollup case, the code only unlock mmap_sem when the lock is contended. In task_iter, we always unlock mmap_sem between two iterations. This is because we don't want to hold mmap_sem while calling the BPF program, which may sleep (calling bpf_d_path). Thanks, Song