On Mon, Jan 27, 2025 at 2:27 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Fri, Jan 24, 2025 at 11:56 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > > > We use map->freeze_mutex to prevent races between map_freeze() and > > memory mapping BPF map contents with writable permissions. The way we > > naively do this means we'll hold freeze_mutex for entire duration of all > > the mm and VMA manipulations, which is completely unnecessary. This can > > potentially also lead to deadlocks, as reported by syzbot in [0]. > > > > So, instead, hold freeze_mutex only during writeability checks, bump > > (proactively) "write active" count for the map, unlock the mutex and > > proceed with mmap logic. And only if something went wrong during mmap > > logic, then undo that "write active" counter increment. > > > > Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after > > mmaping, because we also have a logic that unsets VM_MAYWRITE > > forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on > > for read-only mmaping, but it won't be afterwards. VM_WRITE is > > a consistent way to detect writable mmaping in our implementation. > > bpf_map_mmap_open/bpf_map_mmap_open use VM_MAYWRITE, > > Do they need to change as well? So I didn't want to elaborate too much on this (because of verboseness), but it is indeed non-obvious (I was confused by this for a bit while working on the patch). We have this piece of logic in the middle of bpf_map_mmap(): if (!(vma->vm_flags & VM_WRITE)) vm_flags_clear(vma, VM_MAYWRITE); After this point, VM_WRITE and VM_MAYWRITE are equivalent (when it comes to BPF maps mmap-ing). I.e., if we have writable mapping, we'll have both VM_WRITE and VM_MAYWRITE; if we have read-only mapping, we won't have either. We can't have any other mix of those two. bpf_map_write_active_inc() used to happen after this point, and so we were checking VM_MAYWRITE, but I had to move bpf_map_write_active_inc() before that point, so I switched to VM_WRITE check. bpf_map_mmap_open/bpf_map_mmap_close happen after this vm_flags_clear(vma, VM_MAYWRITE), so whether they use VM_MAYWRITE or VM_WRITE doesn't matter. So they should be fine as is. It is confusing, though, I agree. So maybe we should just normalize all the checks to VM_WRITE and leave a comment that MAYWRITE and WRITE are coupled with our custom mmaping logic? > > > [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@xxxxxxxxxx/ > > > > Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") > > Reported-by: syzbot+4dc041c686b7c816a71e@xxxxxxxxxxxxxxxxxxxxxxxxx > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > kernel/bpf/syscall.c | 20 +++++++++++++------- > > 1 file changed, 13 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 0daf098e3207..0d5b39e99770 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = { > > static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > > { > > struct bpf_map *map = filp->private_data; > > - int err; > > + int err = 0; > > > > if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record)) > > return -ENOTSUPP; > > @@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > > err = -EACCES; > > goto out; > > } > > + bpf_map_write_active_inc(map); > > } > > +out: > > + mutex_unlock(&map->freeze_mutex); > > + if (err) > > + return err; > > > > /* set default open/close callbacks */ > > vma->vm_ops = &bpf_map_default_vmops; > > @@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > > vm_flags_clear(vma, VM_MAYWRITE); > > > > err = map->ops->map_mmap(map, vma); > > - if (err) > > - goto out; > > + if (err) { > > + if (vma->vm_flags & VM_WRITE) { > > + mutex_lock(&map->freeze_mutex); > > + bpf_map_write_active_dec(map); > > + mutex_unlock(&map->freeze_mutex); > > Extra lock/unlock looks unnecessary. > > This functiona and map_freeze() need to see frozen and write_active coherent, > but write_active_dec looks like without mutex. > It's atomic64_dec. Yep, I think you are right. I wanted a no-brainer change and not having to think about any memory ordering effects or anything like that. But seeing bpf_map_is_rdonly() checks this without any lock anyways, I think we should be fine. I can drop this lock/unlock for v2.