On Thu, Sep 29, 2022 at 5:51 PM Carlos Llamas <cmllamas@xxxxxxxxxx> wrote: > > Commit c462ac288f2c ("mm: Introduce arch_validate_flags()") added a late > check in mmap_region() to let architectures validate vm_flags. The check > needs to happen after calling ->mmap() as the flags can potentially be > modified during this callback. > > If arch_validate_flags() check fails we unmap and free the vma. However, > the error path fails to undo the ->mmap() call that previously succeeded > and depending on the specific ->mmap() implementation this translates to > reference increments, memory allocations and other operations what will > not be cleaned up. > > There are several places (mainly device drivers) where this is an issue. > However, one specific example is bpf_map_mmap() which keeps count of the > mappings in map->writecnt. The count is incremented on ->mmap() and then > decremented on vm_ops->close(). When arch_validate_flags() fails this > count is off since bpf_map_mmap_close() is never called. > > One can reproduce this issue in arm64 devices with MTE support. Here the > vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set > previously. From userspace then is enough to pass the PROT_MTE flag to > mmap() syscall to trigger the arch_validate_flags() failure. > > The following program reproduces this issue: > --- > #include <stdio.h> > #include <unistd.h> > #include <linux/unistd.h> > #include <linux/bpf.h> > #include <sys/mman.h> > > int main(void) > { > union bpf_attr attr = { > .map_type = BPF_MAP_TYPE_ARRAY, > .key_size = sizeof(int), > .value_size = sizeof(long long), > .max_entries = 256, > .map_flags = BPF_F_MMAPABLE, > }; > int fd; > > fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)); > mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0); > > return 0; > } > --- > > By manually adding some log statements to the vm_ops callbacks we can > confirm that when passing PROT_MTE to mmap() the map->writecnt is off > upon ->release(): > > With PROT_MTE flag: > root@debian:~# ./bpf-test > [ 111.263874] bpf_map_write_active_inc: map=9 writecnt=1 > [ 111.288763] bpf_map_release: map=9 writecnt=1 > > Without PROT_MTE flag: > root@debian:~# ./bpf-test > [ 157.816912] bpf_map_write_active_inc: map=10 writecnt=1 > [ 157.830442] bpf_map_write_active_dec: map=10 writecnt=0 > [ 157.832396] bpf_map_release: map=10 writecnt=0 > > This patch fixes the above issue by calling vm_ops->close() when the > arch_validate_flags() check fails, after this we can proceed to unmap > and free the vma on the error path. > > Fixes: c462ac288f2c ("mm: Introduce arch_validate_flags()") > Cc: Catalin Marinas <catalin.marinas@xxxxxxx> > Cc: Liam Howlett <liam.howlett@xxxxxxxxxx> > Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> # v5.10+ > Signed-off-by: Carlos Llamas <cmllamas@xxxxxxxxxx> > --- Makes sense to me, open/close callbacks should be symmetrical. From BPF-side of things: Acked-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > mm/mmap.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 9d780f415be3..36c08e2c78da 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1797,7 +1797,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, > if (!arch_validate_flags(vma->vm_flags)) { > error = -EINVAL; > if (file) > - goto unmap_and_free_vma; > + goto close_and_free_vma; > else > goto free_vma; > } > @@ -1844,6 +1844,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr, > > return addr; > > +close_and_free_vma: > + if (vma->vm_ops && vma->vm_ops->close) > + vma->vm_ops->close(vma); > unmap_and_free_vma: > fput(vma->vm_file); > vma->vm_file = NULL; > -- > 2.38.0.rc1.362.ged0d419d3c-goog >