On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote: > > Just a small thing I'll say once, to get it out of my system. No action > required here, I just want it understood: > > Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via > pin_user_pages_remote()"), you would have written that like this: > > "process_vm_writev() is dirtying pages without properly warning the file > system in advance..." > > Because, there were many callers that were doing this: > > get_user_pages*() > ...use the pages... > > for_each_page() { > set_page_dirty*() > put_page() > } Sure, but that's not sufficient when modifying file-backed pages. Previously, there was only two ways of modifying file-backed pages in the page cache --- either using the write(2) system call, or when a mmaped page is modified by the userspace. In the case of write(2), the file system gets notified before the page cache is modified by a call to the address operation's write_begin() call, and after the page cache is modified, the address operation's write_end() call lets the file system know that the modification is done. After the write is done, the 30 second writeback timer is triggered, and in the case of ext4's data=journalling mode, we close the ext4 micro-transation (and therefore the time between write_begin and write_end calls needs to be minimal); otherwise this can block ext4 transactions. In the case of a user page fault, the address operation's page_mkwrite() is called, and at that point we will allocate any blocks needed to back memory if necessary (in the case of delayed allocation, file system space has to get reserved). The problem here for remote access is that page_mkwrite() can block, and it can potentially fail (e.g., with ENOSPC or ENOMEM). This is also why we can't just add the page buffers and do the file system allocation in set_page_dirty(), since set_page_dirty() isn't allowed to block. One approach might be to make all of the pages writeable when pin_user_pages_remote() is called. That that would mean that in the case of remote access via process_vm_writev or RDMA, all of the blocks will be allocated early. But that's probably better since at that point the userspace code is in a position to receive the error when setting up the RDMA memory, and we don't want to be asking the file system to do block allocation when incoming data is coming in from Infiniband or iWARP. > I see that ext4_warning_inode() has rate limiting, but it doesn't look > like it's really intended for a per-page rate. It looks like it is > per-superblock (yes?), so I suspect one instance of this problem, with > many pages involved, could hit the limit. > > Often, WARN_ON_ONCE() is used with per-page assertions. That's not great > either, but it might be better here. OTOH, I have minimal experience > with ext4_warning_inode() and maybe it really is just fine with per-page > failure rates. With the syzbot reproducer, we're not actually triggering the rate limiter, since the ext4 warning is only getting hit a bit more than once every 4 seconds. And I'm guessing that in the real world, people aren't actually trying to do remote direct access to file-backed memory, at least not using ext4, since that's an invitation to a kernel crash, and we would have gotten user complaints. If some user actually tries to use process_vm_writev for realsies, as opposed to a random fuzzer or from a malicious program , we do want to warn them about the potential data loss, so I'd prefer to warn once for each inode --- but I'm not convinced that it's worth the effort. Cheers, - Ted