Re: [PATCH bpf-next] bpf: avoid holding freeze_mutex during mmap operation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 27, 2025 at 3:18 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Mon, Jan 27, 2025 at 2:27 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Fri, Jan 24, 2025 at 11:56 AM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > >
> > > We use map->freeze_mutex to prevent races between map_freeze() and
> > > memory mapping BPF map contents with writable permissions. The way we
> > > naively do this means we'll hold freeze_mutex for entire duration of all
> > > the mm and VMA manipulations, which is completely unnecessary. This can
> > > potentially also lead to deadlocks, as reported by syzbot in [0].
> > >
> > > So, instead, hold freeze_mutex only during writeability checks, bump
> > > (proactively) "write active" count for the map, unlock the mutex and
> > > proceed with mmap logic. And only if something went wrong during mmap
> > > logic, then undo that "write active" counter increment.
> > >
> > > Note, instead of checking VM_MAYWRITE we check VM_WRITE before and after
> > > mmaping, because we also have a logic that unsets VM_MAYWRITE
> > > forcefully, if VM_WRITE is not set. So VM_MAYWRITE could be set early on
> > > for read-only mmaping, but it won't be afterwards. VM_WRITE is
> > > a consistent way to detect writable mmaping in our implementation.
> >
> > bpf_map_mmap_open/bpf_map_mmap_open use VM_MAYWRITE,
> >
> > Do they need to change as well?
>
> So I didn't want to elaborate too much on this (because of
> verboseness), but it is indeed non-obvious (I was confused by this for
> a bit while working on the patch).
>
> We have this piece of logic in the middle of bpf_map_mmap():
>
> if (!(vma->vm_flags & VM_WRITE))
>     vm_flags_clear(vma, VM_MAYWRITE);
>
> After this point, VM_WRITE and VM_MAYWRITE are equivalent (when it
> comes to BPF maps mmap-ing). I.e., if we have writable mapping, we'll
> have both VM_WRITE and VM_MAYWRITE; if we have read-only mapping, we
> won't have either. We can't have any other mix of those two.
>
> bpf_map_write_active_inc() used to happen after this point, and so we
> were checking VM_MAYWRITE, but I had to move
> bpf_map_write_active_inc() before that point, so I switched to
> VM_WRITE check.
>
> bpf_map_mmap_open/bpf_map_mmap_close happen after this
> vm_flags_clear(vma, VM_MAYWRITE), so whether they use VM_MAYWRITE or
> VM_WRITE doesn't matter. So they should be fine as is.

I see. Yeah. I think this analysis is correct.
It seems to be fine as-is.

> It is confusing, though, I agree. So maybe we should just normalize
> all the checks to VM_WRITE and leave a comment that MAYWRITE and WRITE
> are coupled with our custom mmaping logic?

Yeah. That would be nice.

> >
> > >   [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@xxxxxxxxxx/
> > >
> > > Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
> > > Reported-by: syzbot+4dc041c686b7c816a71e@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > ---
> > >  kernel/bpf/syscall.c | 20 +++++++++++++-------
> > >  1 file changed, 13 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 0daf098e3207..0d5b39e99770 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -1035,7 +1035,7 @@ static const struct vm_operations_struct bpf_map_default_vmops = {
> > >  static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> > >  {
> > >         struct bpf_map *map = filp->private_data;
> > > -       int err;
> > > +       int err = 0;
> > >
> > >         if (!map->ops->map_mmap || !IS_ERR_OR_NULL(map->record))
> > >                 return -ENOTSUPP;
> > > @@ -1059,7 +1059,12 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> > >                         err = -EACCES;
> > >                         goto out;
> > >                 }
> > > +               bpf_map_write_active_inc(map);
> > >         }
> > > +out:
> > > +       mutex_unlock(&map->freeze_mutex);
> > > +       if (err)
> > > +               return err;
> > >
> > >         /* set default open/close callbacks */
> > >         vma->vm_ops = &bpf_map_default_vmops;
> > > @@ -1070,13 +1075,14 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
> > >                 vm_flags_clear(vma, VM_MAYWRITE);
> > >
> > >         err = map->ops->map_mmap(map, vma);
> > > -       if (err)
> > > -               goto out;
> > > +       if (err) {
> > > +               if (vma->vm_flags & VM_WRITE) {
> > > +                       mutex_lock(&map->freeze_mutex);
> > > +                       bpf_map_write_active_dec(map);
> > > +                       mutex_unlock(&map->freeze_mutex);
> >
> > Extra lock/unlock looks unnecessary.
> >
> > This functiona and map_freeze() need to see frozen and write_active coherent,
> > but write_active_dec looks like without mutex.
> > It's atomic64_dec.
>
> Yep, I think you are right. I wanted a no-brainer change and not
> having to think about any memory ordering effects or anything like
> that. But seeing bpf_map_is_rdonly() checks this without any lock
> anyways, I think we should be fine. I can drop this lock/unlock for
> v2.

Not only that. map_update/delete do it as well.
So extra mutex_lock provokes questions like mine :)
So pls remove.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux