Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 23 Feb 2022 18:31:33 -0500

On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote:
> 
> Just a small thing I'll say once, to get it out of my system. No action
> required here, I just want it understood:
> 
> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
> pin_user_pages_remote()"), you would have written that like this:
> 
> "process_vm_writev() is dirtying pages without properly warning the file
> system in advance..."
> 
> Because, there were many callers that were doing this:
> 
>     get_user_pages*()
>     ...use the pages...
> 
>     for_each_page() {
>             set_page_dirty*()
>             put_page()
>     }

Sure, but that's not sufficient when modifying file-backed pages.
Previously, there was only two ways of modifying file-backed pages in
the page cache --- either using the write(2) system call, or when a
mmaped page is modified by the userspace.

In the case of write(2), the file system gets notified before the page
cache is modified by a call to the address operation's write_begin()
call, and after the page cache is modified, the address operation's
write_end() call lets the file system know that the modification is
done.  After the write is done, the 30 second writeback timer is
triggered, and in the case of ext4's data=journalling mode, we close
the ext4 micro-transation (and therefore the time between write_begin
and write_end calls needs to be minimal); otherwise this can block
ext4 transactions.

In the case of a user page fault, the address operation's
page_mkwrite() is called, and at that point we will allocate any
blocks needed to back memory if necessary (in the case of delayed
allocation, file system space has to get reserved).  The problem here
for remote access is that page_mkwrite() can block, and it can
potentially fail (e.g., with ENOSPC or ENOMEM).  This is also why we
can't just add the page buffers and do the file system allocation in
set_page_dirty(), since set_page_dirty() isn't allowed to block.

One approach might be to make all of the pages writeable when
pin_user_pages_remote() is called.  That that would mean that in the
case of remote access via process_vm_writev or RDMA, all of the blocks
will be allocated early.  But that's probably better since at that
point the userspace code is in a position to receive the error when
setting up the RDMA memory, and we don't want to be asking the file
system to do block allocation when incoming data is coming in from
Infiniband or iWARP.

> I see that ext4_warning_inode() has rate limiting, but it doesn't look
> like it's really intended for a per-page rate. It looks like it is
> per-superblock (yes?), so I suspect one instance of this problem, with
> many pages involved, could hit the limit.
> 
> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
> either, but it might be better here. OTOH, I have minimal experience
> with ext4_warning_inode() and maybe it really is just fine with per-page
> failure rates.

With the syzbot reproducer, we're not actually triggering the rate
limiter, since the ext4 warning is only getting hit a bit more than
once every 4 seconds.  And I'm guessing that in the real world, people
aren't actually trying to do remote direct access to file-backed
memory, at least not using ext4, since that's an invitation to a
kernel crash, and we would have gotten user complaints.  If some user
actually tries to use process_vm_writev for realsies, as opposed to a
random fuzzer or from a malicious program , we do want to warn them
about the potential data loss, so I'd prefer to warn once for each
inode --- but I'm not convinced that it's worth the effort.

Cheers,

						- Ted