On Mon, Jan 27, 2025 at 3:18 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Mon, Jan 27, 2025 at 2:27 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Fri, Jan 24, 2025 at 11:56 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > > > We use map->freeze_mutex to prevent races between map_freeze() and > > > memory mapping BPF map contents with writable permissions. The way we > > > naively do this means we'll hold freeze_mutex for entire duration of all > > > the mm and VMA manipulations, which is completely unnecessary. This can > > > potentially also lead to deadlocks, as reported by syzbot in [0]. > > > > > > So, instead, hold freeze_mutex only during writeability checks, bump > > > (proactively) "write active" count for the map, unlock the mutex and > > > proceed with mmap logic. And only if something went wrong during mmap > > > logic, then undo that "write active" counter increment. > > > > > > Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after > > > mmaping, because we also have a logic that unsets VM_MAYWRITE > > > forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on > > > for read-only mmaping, but it won't be afterwards. VM_WRITE is > > > a consistent way to detect writable mmaping in our implementation. > > > > bpf_map_mmap_open/bpf_map_mmap_open use VM_MAYWRITE, > > > > Do they need to change as well? > > So I didn't want to elaborate too much on this (because of > verboseness), but it is indeed non-obvious (I was confused by this for > a bit while working on the patch). > > We have this piece of logic in the middle of bpf_map_mmap(): > > if (!(vma->vm_flags & VM_WRITE)) > vm_flags_clear(vma, VM_MAYWRITE); > > After this point, VM_WRITE and VM_MAYWRITE are equivalent (when it > comes to BPF maps mmap-ing). I.e., if we have writable mapping, we'll > have both VM_WRITE and VM_MAYWRITE; if we have read-only mapping, we > won't have either. We can't have any other mix of those two. > > bpf_map_write_active_inc() used to happen after this point, and so we > were checking VM_MAYWRITE, but I had to move > bpf_map_write_active_inc() before that point, so I switched to > VM_WRITE check. > > bpf_map_mmap_open/bpf_map_mmap_close happen after this > vm_flags_clear(vma, VM_MAYWRITE), so whether they use VM_MAYWRITE or > VM_WRITE doesn't matter. So they should be fine as is. I see. Yeah. I think this analysis is correct. It seems to be fine as-is. > It is confusing, though, I agree. So maybe we should just normalize > all the checks to VM_WRITE and leave a comment that MAYWRITE and WRITE > are coupled with our custom mmaping logic? Yeah. That would be nice. > > > > > [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@xxxxxxxxxx/ > > > > > > Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") > > > Reported-by: syzbot+4dc041c686b7c816a71e@xxxxxxxxxxxxxxxxxxxxxxxxx > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > > --- > > > kernel/bpf/syscall.c | 20 +++++++++++++------- > > > 1 file changed, 13 insertions(+), 7 deletions(-) > > > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > index 0daf098e3207..0d5b39e99770 100644 > > > --- a/kernel/bpf/syscall.c > > > +++ b/kernel/bpf/syscall.c > > > @@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = { > > > static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > > > { > > > struct bpf_map *map = filp->private_data; > > > - int err; > > > + int err = 0; > > > > > > if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record)) > > > return -ENOTSUPP; > > > @@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > > > err = -EACCES; > > > goto out; > > > } > > > + bpf_map_write_active_inc(map); > > > } > > > +out: > > > + mutex_unlock(&map->freeze_mutex); > > > + if (err) > > > + return err; > > > > > > /* set default open/close callbacks */ > > > vma->vm_ops = &bpf_map_default_vmops; > > > @@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > > > vm_flags_clear(vma, VM_MAYWRITE); > > > > > > err = map->ops->map_mmap(map, vma); > > > - if (err) > > > - goto out; > > > + if (err) { > > > + if (vma->vm_flags & VM_WRITE) { > > > + mutex_lock(&map->freeze_mutex); > > > + bpf_map_write_active_dec(map); > > > + mutex_unlock(&map->freeze_mutex); > > > > Extra lock/unlock looks unnecessary. > > > > This functiona and map_freeze() need to see frozen and write_active coherent, > > but write_active_dec looks like without mutex. > > It's atomic64_dec. > > Yep, I think you are right. I wanted a no-brainer change and not > having to think about any memory ordering effects or anything like > that. But seeing bpf_map_is_rdonly() checks this without any lock > anyways, I think we should be fine. I can drop this lock/unlock for > v2. Not only that. map_update/delete do it as well. So extra mutex_lock provokes questions like mine :) So pls remove.