On Fri, Jan 24, 2025 at 11:56 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > We use map->freeze_mutex to prevent races between map_freeze() and > memory mapping BPF map contents with writable permissions. The way we > naively do this means we'll hold freeze_mutex for entire duration of all > the mm and VMA manipulations, which is completely unnecessary. This can > potentially also lead to deadlocks, as reported by syzbot in [0]. > > So, instead, hold freeze_mutex only during writeability checks, bump > (proactively) "write active" count for the map, unlock the mutex and > proceed with mmap logic. And only if something went wrong during mmap > logic, then undo that "write active" counter increment. > > Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after > mmaping, because we also have a logic that unsets VM_MAYWRITE > forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on > for read-only mmaping, but it won't be afterwards. VM_WRITE is > a consistent way to detect writable mmaping in our implementation. bpf_map_mmap_open/bpf_map_mmap_close use VM_MAYWRITE, Do they need to change as well? > [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@xxxxxxxxxx/ > > Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") > Reported-by: syzbot+4dc041c686b7c816a71e@xxxxxxxxxxxxxxxxxxxxxxxxx > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > --- > kernel/bpf/syscall.c | 20 +++++++++++++------- > 1 file changed, 13 insertions(+), 7 deletions(-) > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 0daf098e3207..0d5b39e99770 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = { > static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > { > struct bpf_map *map = filp->private_data; > - int err; > + int err = 0; > > if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record)) > return -ENOTSUPP; > @@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > err = -EACCES; > goto out; > } > + bpf_map_write_active_inc(map); > } > +out: > + mutex_unlock(&map->freeze_mutex); > + if (err) > + return err; > > /* set default open/close callbacks */ > vma->vm_ops = &bpf_map_default_vmops; > @@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) > vm_flags_clear(vma, VM_MAYWRITE); > > err = map->ops->map_mmap(map, vma); > - if (err) > - goto out; > + if (err) { > + if (vma->vm_flags & VM_WRITE) { > + mutex_lock(&map->freeze_mutex); > + bpf_map_write_active_dec(map); > + mutex_unlock(&map->freeze_mutex); Extra lock/unlock looks unnecessary. This functiona and map_freeze() need to see frozen and write_active coherent, but write_active_dec looks like without mutex. It's atomic64_dec.