On 2/17/22 20:08, Theodore Ts'o wrote:
On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote:
Yes. And looking at the pair of backtraces below, this looks very much
like another aspect of the "get_user_pages problem" [1], originally
described in Jan Kara's 2018 email [2].
Hmm... I just posted my analysis, which tracks with yours; but I had
forgotten about Jan's 2018 e-mail on the matter.
I'm getting close to posting an RFC for the direct IO conversion to
FOLL_PIN, but even after that, various parts of the kernel (reclaim,
filesystems/block layer) still need to be changed so as to use
page_maybe_dma_pinned() to help avoid this problem. There's a bit
more than that, actually.
The challenge is that fixing this "the right away" is probably not
something we can backport into an LTS kernel, whether it's 5.15 or
5.10... or 4.19.
The only thing which can probably survive getting backported is
something like this. It won't make the right thing happen if someone
is trying to RDMA or call process_vm_writev() into a file-backed
memory region --- but I'm not sure I care. Certainly if the goal is
to make Android kernels, I'm pretty sure they are't either using RDMA,
and I suspect they are probably masking out the process_vm_writev(2)
system call (at least, for Android O and newer). So if the goal is to
just to close some Syzbot bugs, what do folks think about this?
- Ted
Hi Ted!
This seems reasonable...-ish. Although this could turn into a pretty
grand game of whack-a-mole, one filesystem at a time. :)
commit 7711b1fda6f7f04274fa1cba6f092410262b0296
Author: Theodore Ts'o <tytso@xxxxxxx>
Date: Thu Feb 17 22:54:03 2022 -0500
ext4: work around bugs in mm/gup.c that can cause ext4 to BUG()
[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance (or faulting in the file data if the page
Just a small thing I'll say once, to get it out of my system. No action
required here, I just want it understood:
Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
pin_user_pages_remote()"), you would have written that like this:
"process_vm_writev() is dirtying pages without properly warning the file
system in advance..."
Because, there were many callers that were doing this:
get_user_pages*()
...use the pages...
for_each_page() {
set_page_dirty*()
put_page()
}
anyway, moving on smartly...
is not yet in the page cache). This was noted by Jan Kara in 2018[1]
and more recently has resulted in bug reports by Syzbot in various
Android kernels[2].
Fixing this for real is non-trivial, and will never be backportable
into stable kernels. So this is a simple workaround that stops the
kernel from BUG()'ing. The changed pages will not be properly written
back, but given that the current gup code is missing the "read" in
"read-modify-write", the dirty page in the page cache might contain
corrupted data anyway.
[1] https://www.spinics.net/lists/linux-mm/msg142700.html
(Sorry my earlier response mangled this link. I've manually fixed it
here, and am working with our IT to get the root cause addressed.)
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@xxxxxxxxxx
Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@xxxxxxxxxxxxxxxxxxxxxxxxx
Reported-by: Lee Jones <lee.jones@xxxxxxxxxx>
Signed-off-by: Theodore Ts'o <tytso@xxxxxxx>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..3b2f336a90d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_SIZE;
+ /* Should never happen but for buggy gup code */
+ if (!page_has_buffers(page)) {
+ ext4_warning_inode(inode,
+ "page %lu does not have buffers attached", page->index);
I see that ext4_warning_inode() has rate limiting, but it doesn't look
like it's really intended for a per-page rate. It looks like it is
per-superblock (yes?), so I suspect one instance of this problem, with
many pages involved, could hit the limit.
Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
either, but it might be better here. OTOH, I have minimal experience
with ext4_warning_inode() and maybe it really is just fine with per-page
failure rates.
thanks,
--
John Hubbard
NVIDIA
+ ClearPageDirty(page);
+ unlock_page(page);
+ return 0;
+ }
+
page_bufs = page_buffers(page);
/*
* We cannot do block allocation or other extent handling in this
@@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));
+ /* Should never happen but for buggy gup code */
+ if (!page_has_buffers(page)) {
+ ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index);
+ ClearPageDirty(page);
+ unlock_page(page);
+ continue;
+ }
+
if (mpd->map.m_len == 0)
mpd->first_page = page->index;
mpd->next_page = page->index + 1;