[Bug 218011] ext4 root filesystem related hangs on 6.5 kernels

bugzilla-daemon@xxxxxxxxxx · Sun, 15 Oct 2023 19:06:31 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=218011

Theodore Tso (tytso@xxxxxxx) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tytso@xxxxxxx

--- Comment #6 from Theodore Tso (tytso@xxxxxxx) ---
It would be really nice to get a translation from the stack trace offsets to
line numbers, but what appears to be happening is that we're starting a journal
commit, and to complete the journal, since we are in the default data=ordered
mode, we call ext4_journalled_submit_inode_data_buffers(), which in turn calls
write_cache_pages() to flush out modified data blocks associated with an inode
which had newly allocated blocks (so that we don't accidentally expose stale
data blocks if there is a crash, which is a guarantee of data=ordered mode).

The write_cache_pages() function is then calling some function in mm/filemap.c
(this is where a line number translation would be happy), which calls
folio_wait_bit_common(), which presumably is waiting for some memory folio
which is undergoing writeback, or otherwise busy, to complete.   This then
calls io_schedule() --- because we're waiting for some I/O to complete, and
this apparently never completes, thus stalling the jbd2 commit operation, and
then all of the other processes which are trying to make changes to the file
system are waiting for the commit complete, leading to all of the other stack
traces.

The question is why is this happening on your system?    It could be because of
some kind of missed I/O completion interrupt, or some other problem in the
block device layer or NVMe driver ---but normally if that were the case, there
should have been some kind of kernel log messages from those parts of the I/O
stack.   Were there any that you could see (that perhaps were excerpted out in
the bug report, since "obviously" it was assumed this was an ext4 problem, as
opposed to ext4 simply being an innocent victim of problems lower down on the
storage stack?

The other question that might be worth asking is what sort of workload does
your server run, and how might this be different from what other users might be
doing, or what we exercise with out regression tests?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.