Hello,
Starting with kernel 5.15 for the past eight months we have a total of
12 kernel panics at a fleet of 1000 KVM/Qemu machines which look the
following way:
kernel BUG at fs/ext4/inode.c:1914
Switching from kernel 4.14 to 5.15 almost immediately triggered the
problem. Therefore we are very confident that userland activity is more
or less the same and is not the root cause. The kernel function which
triggers the BUG is __ext4_journalled_writepage(). In 5.15 the code for
__ext4_journalled_writepage() in "fs/ext4/inode.c" is the same as the
current kernel "master". The line where the BUG is triggered is:
struct buffer_head *page_bufs = page_buffers(page)
The definition of "page_buffers(page)" in "include/linux/buffer_head.h"
hasn't changed since 4.14, so no difference here. This is where the
actual "kernel BUG" event is triggered:
/* If we *know* page->private refers to buffer_heads */
#define page_buffers(page) \
({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})
#define page_has_buffers(page) PagePrivate(page)
Initially I thought that the issue is already discussed here:
https://lore.kernel.org/all/Yg0m6IjcNmfaSokM@xxxxxxxxxx/
But this seems to be another (solved) problem and Theodore Ts'o already
made a quick fix by simply reporting the rare occurrence and continuing
forward. The commit is in 5.15 (and in the latest kernel), so it's not
helping our case:
https://github.com/torvalds/linux/commit/cc5095747edfb054ca2068d01af20be3fcc3634f
Back to the problem! 99% of the difference between 4.14 and the latest
kernel for __ext4_journalled_writepage() in "fs/ext4/inode.c" comes from
the following commit:
https://github.com/torvalds/linux/commit/5c48a7df91499e371ef725895b2e2d21a126e227
Is it safe that we revert this patch on the latest 5.15 kernel, so that
we can confirm if this resolves the issue for us?
Best regards.
--Ivan