Ted Ts'o explains, in the linux-ext4 thread [1], that currently data=journal + journal_checksum + mmap is not handled correctly, and outlines the changes to fix it (+ items/breaks for clarity): """ The fix is going to have to involve - fixing __ext4_journalled_writepage() to call set_page_writeback() before it unlocks the page, - adding a list of pages under data=journalled writeback which is attached to the transaction handle, - have the jbd2 commit hook call end_page_writeback() on all of these pages, - and then in the places where ext4 calls wait_for_stable_page() or grab_cache_page_write_begin(), we need to add: if (ext4_should_journal_data(inode)) wait_on_page_writeback(page); It's all relatively straightforward except for the part where we have to attach a list of pages to the currently running transaction. That will require adding some plumbing into the jbd2 layer. """ This is my first adventure into ext4, and after enough struggle (and try harder!) with the first deadlock described in PATCH 02, and address it to find the other deadlock (also described there), and address it, I guess it wasn't that straighforward (for me :) but absolutely very good learning! (Granted, I now understand that the fix outlined is indeed the way it is supposed to work and done in general, weren't it for the need to unlock_page() before ext4_journal_start(), as that compromised the base of writeback patterns as far as I learned. Hope I didn't get that too wrong.) Summary: ------- The patchset is a bit long with 11 patches, but I tried to get changes tiny to help with review, and better document how each of them work, why and how this or that is done. It's RFC as I would like to ask for suggestions/feedback, if at all possible. Patch 01 and 02 implement the outlined fix, with a few changes (fix first deadlock; use existing plumbing in jbd2 as the list.) Patch 03 fix a seconds-delay on msync(). Patch 04 introduces helpers to handle the second deadlock. Patch 05-11 handle the second deadlock (three of these patches, namely 07, 09 and 10 are changes not specific for data=journal, affecting other journaling modes, so it's not on their subject.) The order of the patches intentionally allow the issues on 03 and 05-11 to occur (while putting the core patches first), so to allow issues to be reproduced/regression tested one by one, as needed. It can be changed, of course, so to enable actual writeback changes in the last patch (when issues are patched.) Testing: ------- This has been built and regression tested on next-20200417. (Also rebased and build tested on next-20200423 / "today"). On xfstests (commit b2faf204) quick group (and excluding generic 430/431/434 which always hung): no regressions w/ data=ordered (default) nor data=journal,journal_checksum. With data=ordered: (on both original and patched kernel) Failures: generic/223 generic/465 generic/553 generic/554 generic/565 generic/570 With data=journal,journal_checksum: (likewise) Failures: ext4/044 generic/223 generic/441 generic/553 generic/554 generic/565 generic/570 The test-case for the problem (and deadlocks) and further stress testing is stress-ng (with 512 workers on 16 vCPUs) $ sudo mount -o data=journal,journal_checksum $DEV $MNT $ cd $MNT $ sudo stress-ng --mmap 512 --mmap-file --timeout 1w To reproduce the problem (without patchset), run it a bit and crash the kernel (to cause unclean shutdown) w/ sysrq, and mount the device again (it should fail / need e2fsck): Original: [ 27.660063] JBD2: Invalid checksum recovering data block 79449 in log [ 27.792371] JBD2: recovery failed [ 27.792854] EXT4-fs (vdc): error loading journal mount: /tmp/ext4: can't read superblock on /dev/vdc. Patched: [ 33.111230] EXT4-fs (vdc): 512 orphan inodes deleted [ 33.111961] EXT4-fs (vdc): recovery complete [ 33.114590] EXT4-fs (vdc): mounted filesystem with journalled data mode. Opts: data=journal,journal_checksum RFC / Questions: --------------- 0) Usage of ext4_inode_info.i_datasync_tid for checks We rely on the struct ext4_inode_info.i_datasync_tid field (set by __ext4_journalled_writepage() and others) to check it against the running transaction. Of course, other sites set it too, and it might be that some of our checks return false positives then (should be fine, just less efficient.) To avoid such false positives, we could add another field to that structure, exclusively for this, but that is more 8 bytes (pointer) for inodes and even on non-data=journal cases.. so it didn't seem good enough reason, but if that is better/worth it for efficiency reasons (speed, in this case, vs. memory consumption) we could do it. Maybe there are other ideas/better ways to do it? 1) Usage of ext4_force_commit() in ext4_writepages() Patch 03 describes/fixes an issue where the underlying problem is, if __ext4_journalled_writepage() does set_page_writeback() but no journal commit is triggered, wait_on_page_writeback() may wait up to seconds until the periodic journal commit happens. The solution there, to fix the impact on msync(), is to just call ext4_force_commit() (as it's done anyway in ext4_sync_file()), on ext4_writepages(). Is that a good enough solution? Other ideas? 2) Similar issue (unhandled) in ext4_writepage() The other, related question is, what about direct callers of ext4_writepage() that obviously do not use ext4_writepages() ? (e.g., pageout() and writeout(); write_one_page() not used.) Those are memory-cleasing writeback, which should not wait, however, as mentioned in that patch, if its writeback goes on for seconds and an data-integrity writeback/system call comes in, it is delayed/wait_on_page_writeback() that long. So, ideally, we should be trying to kick a journal commit? It looks like ext4_handle_sync() is not the answer, since it waits for commit to finish, and pageout() is called on a list of pages by shrinking. So, not effective to block on each one of them. We might not want to start anything right now, actually, since the memory-cleasing writeback can be happening on memory pressure scenarios, right? But would need to do something, to ensure that future wait_on_page_writeback() do not wait too long. Maybe the answer is something similar to jbd2 sync transaction batching (used by ext4_handle_sync()), but in *async* fashion, say, possibly implemented/batching in the jbd2 worker thread. Is that reasonable? ... Any comments/feedback/reviews are very appreciated. Thank you in advance, Mauricio [1] https://lore.kernel.org/linux-ext4/20190830012236.GC10779@xxxxxxx/ Mauricio Faria de Oliveira (11): ext4: data=journal: introduce struct/kmem_cache ext4_journalled_wb_page/_cachep ext4: data=journal: handle page writeback in __ext4_journalled_writepage() ext4: data=journal: call ext4_force_commit() in ext4_writepages() for msync() ext4: data=journal: introduce helpers for journalled writeback deadlock ext4: data=journal: prevent journalled writeback deadlock in __ext4_journalled_writepage() ext4: data=journal: prevent journalled writeback deadlock in ext4_write_begin() ext4: grab page before starting transaction handle in ext4_convert_inline_data_to_extent() ext4: data=journal: prevent journalled writeback deadlock in ext4_convert_inline_data_to_extent() ext4: grab page before starting transaction handle in ext4_try_to_write_inline_data() ext4: deduplicate code with error legs in ext4_try_to_write_inline_data() ext4: data=journal: prevent journalled writeback deadlock in ext4_try_to_write_inline_data() fs/ext4/ext4_jbd2.h | 88 +++++++++++++++++++++++++ fs/ext4/inline.c | 153 +++++++++++++++++++++++++++++++------------- fs/ext4/inode.c | 137 +++++++++++++++++++++++++++++++++++++-- fs/ext4/page-io.c | 11 ++++ 4 files changed, 341 insertions(+), 48 deletions(-) -- 2.20.1