On 2/23/22 3:31 PM, Theodore Ts'o wrote: > On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote: >> >> Just a small thing I'll say once, to get it out of my system. No action >> required here, I just want it understood: >> >> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via >> pin_user_pages_remote()"), you would have written that like this: >> >> "process_vm_writev() is dirtying pages without properly warning the file >> system in advance..." >> >> Because, there were many callers that were doing this: >> >> get_user_pages*() >> ...use the pages... >> >> for_each_page() { >> set_page_dirty*() >> put_page() >> } > > Sure, but that's not sufficient when modifying file-backed pages. We are in complete agreement. I was just trying to hint (too subtly) that the problem is ages old, and after doing all this work on the prerequisites to solving it (pin_user_pages() is a prerequisite), it kind of bothers me to see a commit with "work around bugs in mm/gup.c" in the title. That's all. :) > Previously, there was only two ways of modifying file-backed pages in > the page cache --- either using the write(2) system call, or when a > mmaped page is modified by the userspace. > > In the case of write(2), the file system gets notified before the page > cache is modified by a call to the address operation's write_begin() > call, and after the page cache is modified, the address operation's > write_end() call lets the file system know that the modification is > done. After the write is done, the 30 second writeback timer is > triggered, and in the case of ext4's data=journalling mode, we close > the ext4 micro-transation (and therefore the time between write_begin > and write_end calls needs to be minimal); otherwise this can block > ext4 transactions. > > In the case of a user page fault, the address operation's > page_mkwrite() is called, and at that point we will allocate any > blocks needed to back memory if necessary (in the case of delayed > allocation, file system space has to get reserved). The problem here > for remote access is that page_mkwrite() can block, and it can > potentially fail (e.g., with ENOSPC or ENOMEM). This is also why we > can't just add the page buffers and do the file system allocation in > set_page_dirty(), since set_page_dirty() isn't allowed to block. Oh yes. Believe me, I am well-versed in that story! But it's always nice to hear it again, especially from a file system maintainer. Each time there is something new, such as the micro-transaction detail. > > One approach might be to make all of the pages writeable when > pin_user_pages_remote() is called. That that would mean that in the > case of remote access via process_vm_writev or RDMA, all of the blocks > will be allocated early. But that's probably better since at that > point the userspace code is in a position to receive the error when > setting up the RDMA memory, and we don't want to be asking the file > system to do block allocation when incoming data is coming in from > Infiniband or iWARP. So it sounds like the file lease idea, yes? I'm hoping that that is still viable. I still think it's a good LSF/MM topic, to work through. > >> I see that ext4_warning_inode() has rate limiting, but it doesn't look >> like it's really intended for a per-page rate. It looks like it is >> per-superblock (yes?), so I suspect one instance of this problem, with >> many pages involved, could hit the limit. >> >> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great >> either, but it might be better here. OTOH, I have minimal experience >> with ext4_warning_inode() and maybe it really is just fine with per-page >> failure rates. > > With the syzbot reproducer, we're not actually triggering the rate > limiter, since the ext4 warning is only getting hit a bit more than > once every 4 seconds. And I'm guessing that in the real world, people > aren't actually trying to do remote direct access to file-backed > memory, at least not using ext4, since that's an invitation to a > kernel crash, and we would have gotten user complaints. If some user Actually...I can confirm that real customers really are doing *exactly* that! Despite the kernel crashes--because the crashes don't always happen unless you have a large (supercomputer-sized) installation. And even then it is not always root-caused properly. I guess that goes in the "weird but true" category. :) thanks, -- John Hubbard NVIDIA > actually tries to use process_vm_writev for realsies, as opposed to a > random fuzzer or from a malicious program , we do want to warn them > about the potential data loss, so I'd prefer to warn once for each > inode --- but I'm not convinced that it's worth the effort. > > Cheers, > > - Ted