On Wed, Feb 16, 2022 at 04:31:36PM +0000, Lee Jones wrote: > > After recently receiving a bug report from Syzbot [0] which was raised > specifically against the Android v5.10 kernel, I spent some time > trying to get to the crux. Firstly I reproduced the issue on the > reported kernel, then did the same using the latest release kernel > v5.16. > > The full kernel trace can be seen below at [1]. > Lee, thanks for your work in trimming down the syzkaller reproducer. The moral equivalent of what it is doing (except each system call is done in a separate thread, with synchronization so each gets executed in order, but perhaps on a different CPU) is: int dest_fd, src_fd, truncate_fd, mmap_fd; pid_t tid; struct iovec local_iov, remote_iov; dest_fd = open("./bus", O_RDWR|O_CREAT|O_NONBLOCK|O_SYNC| O_DIRECT|O_LARGEFILE|O_NOATIME, 0); src_fd = openat(AT_FDCWD, "/bin/bash", O_RDONLY); truncate_fd = syscall(__NR_open, "./bus", O_RDWR|O_CREAT|O_SYNC|O_NOATIME|O_ASYNC); ftruncate(truncate_fd, 0x2008002ul); mmap((void *) 0x20000000ul /* addr */, 0x600000ul /* length */, PROT_WRITE|PROT_EXEC|PROT_SEM||0x7ffff0, MAP_FIXED|MAP_SHARED, mmap_fd, 0); sendfile(dest_fd, src_fd, 0 /* offset */, 0x80000005ul /* count */); local_iov.iov_base = (void *) 0x2034afa4; local_iov.iov_len = 0x1f80; remote_iov.iov_base = (void *) 0x20000080; remote_iov.iov_len = 0x2034afa5; process_vm_writev(gettid(), &local_iov, 1, &remote_iov, 1, 0); sendfile(dest_fd, src_fd, 0 /* offset */, 0x1f000005ul /* count */); Which is then executed repeataedly over and over again. (With the file descriptors closed at each loop, so the file is reopened each time.) So basically, we have a scratch file which is initialized using an sendfile using O_DIRECT. The scratch file is also mmap'ed into the process's address space, and we then *modify* that an mmap'ed reagion using process_vm_writev() --- and this is where the problem starts. process_vm_writev() uses [un]pin_user_pages_remote() which is the same interface uses for RDMA. But it's not clear this is ever supposed to work for memory which is mmap'ed region backed by a file. pin_user_pages_remote() appears to assume that it is an anonymous region, since the get_user_pages functions in mm/gup.c don't call read_page() to read data into any pages that might not be mmaped in. They also don't follow the mm / file system protocol of calling the file system's write_begin() or page_mkwrite() before modifying a page, and so when when process_vm_writev() calls unpin_user_pages_remote(), this tries to dirty the page, but since page_mkwrite() or write_begin() hasn't been called, buffers haven't been attached to the page, and so that triggers the following ext4 WARN_ON: [ 1451.095939] WARNING: CPU: 1 PID: 449 at fs/ext4/inode.c:3565 ext4_set_page_dirty+0x48/0x50 ... [ 1451.139877] Call Trace: [ 1451.140833] <TASK> [ 1451.141889] folio_mark_dirty+0x2f/0x60 [ 1451.143408] set_page_dirty_lock+0x3e/0x60 [ 1451.145130] unpin_user_pages_dirty_lock+0x108/0x130 [ 1451.147405] process_vm_rw_single_vec.constprop.0+0x1b9/0x260 [ 1451.150135] process_vm_rw_core.constprop.0+0x156/0x280 [ 1451.159576] process_vm_rw+0xc4/0x110 Then when ext4_writepages() gets called, we trigger the BUG because buffers haven't been attached to the page. We can certainly avoid the BUG by checking to see if buffers are attached first, and if not, skip writing the page trigger a WARN_ON, and then forcibly clear the page dirty flag so don't permanently leak memory and allow the file system to be unmounted. (Note: we can't fix the problem by attaching the buffers in set_page_dirty(), since is occasionally called under spinlocks and without the page being locked, so we can't do any kind of allocation, so ix-nay on calling create_empty_buffers() which calls alloc_buffer_head().) BUT, that is really papering over the problem, since it's not clear it's valid to try to use get_user_pages() and friends (including [un]pin_user_pages_remote() on file-backed memory. And if it is supposed to be valid, then mm/gup.c needs to first call readpage() if the page is not present, so that if process_vm_writev() is only modifying a few bytes in the mmap'ed region, we need to fault in the page first, and then mm/gup.c needs to inform the file system to make sure that if pinned memory region is not yet allocated, than whatever needs to happen to allocate space, via page_mkwrite() has taken place. (And by the way, that means that pin_user_pages_remote() may need to return ENOSPC if there is not free space in the file system, and hence ENOSPC may need to reflected all the way back to process_vm_writev(). Alternatively, if we don't expect process_vm_writev() to work on file-backed memory, perhaps it and pin_user_pages_remote() should return some kind of error? - Ted