https://bugzilla.kernel.org/show_bug.cgi?id=218011 Theodore Tso (tytso@xxxxxxx) changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tytso@xxxxxxx --- Comment #6 from Theodore Tso (tytso@xxxxxxx) --- It would be really nice to get a translation from the stack trace offsets to line numbers, but what appears to be happening is that we're starting a journal commit, and to complete the journal, since we are in the default data=ordered mode, we call ext4_journalled_submit_inode_data_buffers(), which in turn calls write_cache_pages() to flush out modified data blocks associated with an inode which had newly allocated blocks (so that we don't accidentally expose stale data blocks if there is a crash, which is a guarantee of data=ordered mode). The write_cache_pages() function is then calling some function in mm/filemap.c (this is where a line number translation would be happy), which calls folio_wait_bit_common(), which presumably is waiting for some memory folio which is undergoing writeback, or otherwise busy, to complete. This then calls io_schedule() --- because we're waiting for some I/O to complete, and this apparently never completes, thus stalling the jbd2 commit operation, and then all of the other processes which are trying to make changes to the file system are waiting for the commit complete, leading to all of the other stack traces. The question is why is this happening on your system? It could be because of some kind of missed I/O completion interrupt, or some other problem in the block device layer or NVMe driver ---but normally if that were the case, there should have been some kind of kernel log messages from those parts of the I/O stack. Were there any that you could see (that perhaps were excerpted out in the bug report, since "obviously" it was assumed this was an ext4 problem, as opposed to ext4 simply being an innocent victim of problems lower down on the storage stack? The other question that might be worth asking is what sort of workload does your server run, and how might this be different from what other users might be doing, or what we exercise with out regression tests? -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.