Re: [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 27, 2025 at 2:27 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Jan 24, 2025 at 11:56 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> > We use map->freeze_mutex to prevent races between map_freeze() and
> > memory mapping BPF map contents with writable permissions. The way we
> > naively do this means we'll hold freeze_mutex for entire duration of all
> > the mm and VMA manipulations, which is completely unnecessary. This can
> > potentially also lead to deadlocks, as reported by syzbot in [0].
> >
> > So, instead, hold freeze_mutex only during writeability checks, bump
> > (proactively) "write active" count for the map, unlock the mutex and
> > proceed with mmap logic. And only if something went wrong during mmap
> > logic, then undo that "write active" counter increment.
> >
> > Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after
> > mmaping, because we also have a logic that unsets VM_MAYWRITE
> > forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on
> > for read-only mmaping, but it won't be afterwards. VM_WRITE is
> > a consistent way to detect writable mmaping in our implementation.
>
> bpf_map_mmap_open/bpf_map_mmap_open use VM_MAYWRITE,
>
> Do they need to change as well?

So I didn't want to elaborate too much on this (because of
verboseness), but it is indeed non-obvious (I was confused by this for
a bit while working on the patch).

We have this piece of logic in the middle of bpf_map_mmap():

if (!(vma->vm_flags & VM_WRITE))
    vm_flags_clear(vma, VM_MAYWRITE);

After this point, VM_WRITE and VM_MAYWRITE are equivalent (when it
comes to BPF maps mmap-ing). I.e., if we have writable mapping, we'll
have both VM_WRITE and VM_MAYWRITE; if we have read-only mapping, we
won't have either. We can't have any other mix of those two.

bpf_map_write_active_inc() used to happen after this point, and so we
were checking VM_MAYWRITE, but I had to move
bpf_map_write_active_inc() before that point, so I switched to
VM_WRITE check.

bpf_map_mmap_open/bpf_map_mmap_close happen after this
vm_flags_clear(vma, VM_MAYWRITE), so whether they use VM_MAYWRITE or
VM_WRITE doesn't matter. So they should be fine as is.

It is confusing, though, I agree. So maybe we should just normalize
all the checks to VM_WRITE and leave a comment that MAYWRITE and WRITE
are coupled with our custom mmaping logic?

>
> >   [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@xxxxxxxxxx/
> >
> > Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
> > Reported-by: syzbot+4dc041c686b7c816a71e@xxxxxxxxxxxxxxxxxxxxxxxxx
> > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > ---
> >  kernel/bpf/syscall.c | 20 +++++++++++++-------
> >  1 file changed, 13 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 0daf098e3207..0d5b39e99770 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
> >  static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >  {
> >         struct bpf_map *map = filp->private_data;
> > -       int err;
> > +       int err = 0;
> >
> >         if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record))
> >                 return -ENOTSUPP;
> > @@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >                         err = -EACCES;
> >                         goto out;
> >                 }
> > +               bpf_map_write_active_inc(map);
> >         }
> > +out:
> > +       mutex_unlock(&map->freeze_mutex);
> > +       if (err)
> > +               return err;
> >
> >         /* set default open/close callbacks */
> >         vma->vm_ops = &bpf_map_default_vmops;
> > @@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> >                 vm_flags_clear(vma, VM_MAYWRITE);
> >
> >         err = map->ops->map_mmap(map, vma);
> > -       if (err)
> > -               goto out;
> > +       if (err) {
> > +               if (vma->vm_flags & VM_WRITE) {
> > +                       mutex_lock(&map->freeze_mutex);
> > +                       bpf_map_write_active_dec(map);
> > +                       mutex_unlock(&map->freeze_mutex);
>
> Extra lock/unlock looks unnecessary.
>
> This functiona and map_freeze() need to see frozen and write_active coherent,
> but write_active_dec looks like without mutex.
> It's atomic64_dec.

Yep, I think you are right. I wanted a no-brainer change and not
having to think about any memory ordering effects or anything like
that. But seeing bpf_map_is_rdonly() checks this without any lock
anyways, I think we should be fine. I can drop this lock/unlock for
v2.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux