On Tue, Jan 05, 2021 at 07:38:08PM +0000, Song Liu wrote: > > > > On Jan 5, 2021, at 9:27 AM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Tue, Jan 5, 2021 at 9:11 AM Song Liu <songliubraving@xxxxxx> wrote: > >> > >> > >> > >>> On Jan 5, 2021, at 8:27 AM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > >>> > >>> On Mon, Jan 4, 2021 at 9:47 PM Song Liu <songliubraving@xxxxxx> wrote: > >>>> > >>>> > >>>> > >>>>> On Jan 4, 2021, at 5:46 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > >>>>> > >>>>> On Fri, Dec 18, 2020 at 05:23:25PM +0000, Song Liu wrote: > >>>>>> > >>>>>> > >>>>>>> On Dec 18, 2020, at 8:38 AM, Yonghong Song <yhs@xxxxxx> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 12/17/20 9:23 PM, Alexei Starovoitov wrote: > >>>>>>>> On Thu, Dec 17, 2020 at 8:33 PM Song Liu <songliubraving@xxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>> ahh. I missed that. Makes sense. > >>>>>>>>>> vm_file needs to be accurate, but vm_area_struct should be accessed as ptr_to_btf_id. > >>>>>>>>> > >>>>>>>>> Passing pointer of vm_area_struct into BPF will be tricky. For example, shall we > >>>>>>>>> allow the user to access vma->vm_file? IIUC, with ptr_to_btf_id the verifier will > >>>>>>>>> allow access of vma->vm_file as a valid pointer to struct file. However, since the > >>>>>>>>> vma might be freed, vma->vm_file could point to random data. > >>>>>>>> I don't think so. The proposed patch will do get_file() on it. > >>>>>>>> There is actually no need to assign it into a different variable. > >>>>>>>> Accessing it via vma->vm_file is safe and cleaner. > >>>>>>> > >>>>>>> I did not check the code but do you have scenarios where vma is freed but old vma->vm_file is not freed due to reference counting, but > >>>>>>> freed vma area is reused so vma->vm_file could be garbage? > >>>>>> > >>>>>> AFAIK, once we unlock mmap_sem, the vma could be freed and reused. I guess ptr_to_btf_id > >>>>>> or probe_read would not help with this? > >>>>> > >>>>> Theoretically we can hack the verifier to treat some ptr_to_btf_id as "less > >>>>> valid" than the other ptr_to_btf_id, but the user experience will not be great. > >>>>> Reading such bpf prog will not be obvious. I think it's better to run bpf prog > >>>>> in mmap_lock then and let it access vma->vm_file. After prog finishes the iter > >>>>> bit can do if (mmap_lock_is_contended()) before iterating. That will deliver > >>>>> better performance too. Instead of task_vma_seq_get_next() doing > >>>>> mmap_lock/unlock at every vma. No need for get_file() either. And no > >>>>> __vm_area_struct exposure. > >>>> > >>>> I think there might be concern calling BPF program with mmap_lock, especially that > >>>> the program is sleepable (for bpf_d_path). It shouldn't be a problem for common > >>>> cases, but I am not 100% sure for corner cases (many instructions in BPF + sleep). > >>>> Current version is designed to be very safe for the workload, which might be too > >>>> conservative. > >>> > >>> I know and I agree with all that, but how do you propose to fix the > >>> vm_file concern > >>> without holding the lock while prog is running? I couldn't come up with a way. > >> > >> I guess the gap here is that I don't see why __vm_area_struct exposure is an issue. > >> It is similar to __sk_buff, and simpler (though we had more reasons to introduce > >> __sk_buff back when there wasn't BTF). > >> > >> If we can accept __vm_area_struct, current version should work, as it doesn't have > >> problem with vm_file > > > > True. The problem with __vm_area_struct is that it is a hard coded > > uapi with little to none > > extensibility. In this form vma iterator is not really useful beyond > > the example in selftest. > > With __vm_area_struct, we can still probe_read the page table, so we can > cover more use cases than the selftest. But I agree that it is not as > extensible as feeding real vm_area_struct with BTF to the BPF program. Another confusing thing with __vm_area_struct is vm_flags field. It's copied directly. As __vm_area_struct->flags this field is uapi field, but its contents are kernel internal. We avoided such corner cases in the past. When flags field is copied into uapi the kernel internal flags are encoded and exposed as separate uapi flags. That was the case with BPF_TCP_* flags. If we have to do this with vm_flags (VM_EXEC, etc) that might kill the patchset due to abi concerns.