On Wed, Oct 16, 2024 at 5:40 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote: > * Jann Horn <jannh@xxxxxxxxxx> [241016 11:08]: > > vms_abort_munmap_vmas() is a recovery path where, on entry, some VMAs > > have already been torn down halfway (in a way we can't undo) but are > > still present in the maple tree. > > > > At this point, we *must* remove the VMAs from the VMA tree, otherwise > > we get UAF. > > Wait, the vma should be re-attached without PTEs and files closed in > this case. I don't see how we are hitting the UAF in your example - we > shouldn't unless there is something with the driver itself and the file > close? > > My concern is that I am missing an error scenario. Where are the files supposed to be closed? vms_clean_up_area() unlinks the VMA from the file and calls ->close() but doesn't unlink the file, right? FWIW, I tested on commit eca631b8fe80 (current mainline head), I didn't check whether anything in the MM tree already addresses this. (I probably should have...) > But since this is under a lock that allows sleeping, shouldn't the > shrinker kick in? Yeah, I think without error injection, you'd basically only fail this allocation if the OOM killer has already decided to kill your task and the system is entirely out of memory. OOM kills IIRC have two effects on the page allocator: 1. they allow allocating with no watermarks (ALLOC_OOM) (based on the theory that you might need to allocate some memory in order to be able to exit and free more memory) 2. they allow giving up on GFP_KERNEL allocations (see the "/* Avoid allocations with no watermarks from looping endlessly */" part of __alloc_pages_slowpath()) > Should we just use __GFP_NOFAIL for the first store > instead of the error path? Which first store? Do you mean get rid of vms_abort_munmap_vmas() entirely somehow? > > Fixes: 4f87153e82c4 ("mm: change failure of MAP_FIXED to restoring the gap on failure") > > Signed-off-by: Jann Horn <jannh@xxxxxxxxxx> > > --- > > This can be tested with the following reproducer (on a kernel built with > > CONFIG_KASAN=y, CONFIG_FAILSLAB=y, CONFIG_FAULT_INJECTION_DEBUG_FS=y, > > with the reproducer running as root): > > > > ``` > > > > typeof(x) __res = (x); \ > > if (__res == (typeof(x))-1) \ > > err(1, "SYSCHK(" #x ")"); \ > > __res; \ > > }) > > > > static void write_file(char *name, char *buf) { > > int fd = open(name, O_WRONLY); > > if (fd == -1) > > err(1, "unable to open for writing: %s", name); > > if (SYSCHK(write(fd, buf, strlen(buf))) != strlen(buf)) > > errx(1, "write %s", name); > > SYSCHK(close(fd)); > > } > > > > int main(void) { > > // make a large area with a bunch of VMAs > > char *area = SYSCHK(mmap(NULL, AREA_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)); > > for (int off=0; off<AREA_SIZE; off += 0x2000) > > SYSCHK(mmap(area+off, 0x1000, PROT_READ, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)); > > > > // open a file whose mappings use usbdev_vm_ops, and map it in part of this area > > int map_fd = SYSCHK(open("/dev/bus/usb/001/001", O_RDONLY)); > > void *map = SYSCHK(mmap(area, 0x1000, PROT_READ, MAP_SHARED|MAP_FIXED, map_fd, 0)); > > close(map_fd); > > > > // make RWX mapping request fail late > > SYSCHK(prctl(65/*PR_SET_MDWE*/, 1/*PR_MDWE_REFUSE_EXEC_GAIN*/, 0, 0, 0)); > > > > // some random other file > > int fd = SYSCHK(open("/etc/passwd", O_RDONLY)); > > > > /* disarm for now */ > > write_file("/sys/kernel/debug/failslab/probability", "0"); > > > > /* fail allocations of maple_node... */ > > write_file("/sys/kernel/debug/failslab/cache-filter", "Y"); > > write_file("/sys/kernel/slab/maple_node/failslab", "1"); > > /* ... even though it's a sleepable allocation... */ > > write_file("/sys/kernel/debug/failslab/ignore-gfp-wait", "N"); > > /* ... in this task... */ > > write_file("/sys/kernel/debug/failslab/task-filter", "Y"); > > write_file("/proc/self/make-it-fail", "1"); > > /* ... every time... */ > > write_file("/sys/kernel/debug/failslab/times", "-1"); > > /* ... after 8 successful allocations (value chosen experimentally)... */ > > write_file("/sys/kernel/debug/failslab/space", "2048"); // one object is 256 > > /* ... and print where the allocations actually failed... */ > > write_file("/sys/kernel/debug/failslab/verbose", "2"); > > /* ... and that's it, arm it */ > > write_file("/sys/kernel/debug/failslab/probability", "100"); > > > > // This mmap request will fail late due to MDWE. > > // The error recovery path will attempt to clear out the VMA pointers with a > > // spanning_store (which will be blocked by error injection). > > mmap(area, AREA_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, fd, 0); > > > > /* disarm */ > > write_file("/sys/kernel/debug/failslab/probability", "0"); > > > > SYSCHK(munmap(map, 0x1000)); // UAF expected here > > system("cat /proc/$PPID/maps"); > > } > > ``` > > > > Result: > > ``` > > FAULT_INJECTION: forcing a failure. > > name failslab, interval 1, probability 100, space 256, times -1 > > CPU: 3 UID: 0 PID: 607 Comm: unmap-fail Not tainted 6.12.0-rc3-00013-geca631b8fe80 #518 > > [...] > > Call Trace: > > <TASK> > > dump_stack_lvl+0x80/0xa0 > > should_fail_ex+0x4d3/0x5c0 > > [...] > > should_failslab+0xc7/0x130 > > kmem_cache_alloc_noprof+0x73/0x3a0 > > [...] > > mas_alloc_nodes+0x3a3/0x690 > > mas_nomem+0xaa/0x1d0 > > mas_store_gfp+0x515/0xa80 > > [...] > > mmap_region+0xa96/0x2590 > > [...] > > do_mmap+0x71e/0xfe0 > > [...] > > vm_mmap_pgoff+0x17a/0x2f0 > > [...] > > ksys_mmap_pgoff+0x2ee/0x460 > > do_syscall_64+0x68/0x140 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > [...] > > </TASK> > > mmap: unmap-fail: (607) Unable to abort munmap() operation > > ================================================================== > > BUG: KASAN: slab-use-after-free in dec_usb_memory_use_count+0x365/0x430 > > Read of size 8 at addr ffff88810e9ba8b8 by task unmap-fail/607 > > What was this pointer? Should be the "struct usb_memory *usbm". > > > > CPU: 3 UID: 0 PID: 607 Comm: unmap-fail Not tainted 6.12.0-rc3-00013-geca631b8fe80 #518 > > [...] > > Call Trace: > > <TASK> > > dump_stack_lvl+0x66/0xa0 > > print_report+0xce/0x670 > > [...] > > kasan_report+0xf7/0x130 > > [...] > > dec_usb_memory_use_count+0x365/0x430 > > remove_vma+0x76/0x120 > > vms_complete_munmap_vmas+0x447/0x750 > > do_vmi_align_munmap+0x4b9/0x700 > > [...] > > do_vmi_munmap+0x164/0x2e0 > > __vm_munmap+0x128/0x2a0 > > [...] > > __x64_sys_munmap+0x59/0x80 > > do_syscall_64+0x68/0x140 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > [...] > > </TASK> > > > > Allocated by task 607: > > kasan_save_stack+0x33/0x60 > > kasan_save_track+0x14/0x30 > > __kasan_kmalloc+0xaa/0xb0 > > usbdev_mmap+0x1a0/0xaf0 > > mmap_region+0xf6e/0x2590 > > do_mmap+0x71e/0xfe0 > > vm_mmap_pgoff+0x17a/0x2f0 > > ksys_mmap_pgoff+0x2ee/0x460 > > do_syscall_64+0x68/0x140 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > Freed by task 607: > > kasan_save_stack+0x33/0x60 > > kasan_save_track+0x14/0x30 > > kasan_save_free_info+0x3b/0x60 > > __kasan_slab_free+0x4f/0x70 > > kfree+0x148/0x450 > > vms_clean_up_area+0x188/0x220 > > What line is this? Should be the vma->vm_ops->close(vma) call. (That would call dec_usb_memory_use_count(), which tail-calls kfree(), so the dec_usb_memory_use_count() wouldn't show up in a stack trace.) I don't have this kernel build anymore though, sorry. If you want I'll rebuild and get a properly symbolized trace.