Hi, On Thu, Apr 23, 2020 at 8:37 PM Mauricio Faria de Oliveira <mfo@xxxxxxxxxxxxx> wrote: [snip] > Summary: > ------- > > The patchset is a bit long with 11 patches, but I tried to get > changes tiny to help with review, and better document how each > of them work, why and how this or that is done. It's RFC as I > would like to ask for suggestions/feedback, if at all possible. If at all possible, may this patchset have at least a cursory look? I'm aware it's been a busy period for some of you, so I just wanted to friendly ping on it, in case this got buried deep under other stuff. Thanks! > > Patch 01 and 02 implement the outlined fix, with a few changes > (fix first deadlock; use existing plumbing in jbd2 as the list.) > > Patch 03 fix a seconds-delay on msync(). > > Patch 04 introduces helpers to handle the second deadlock. > > Patch 05-11 handle the second deadlock (three of these patches, > namely 07, 09 and 10 are changes not specific for data=journal, > affecting other journaling modes, so it's not on their subject.) > > The order of the patches intentionally allow the issues on 03 > and 05-11 to occur (while putting the core patches first), so > to allow issues to be reproduced/regression tested one by one, > as needed. It can be changed, of course, so to enable actual > writeback changes in the last patch (when issues are patched.) > > > Testing: > ------- > > This has been built and regression tested on next-20200417. > (Also rebased and build tested on next-20200423 / "today"). > > On xfstests (commit b2faf204) quick group (and excluding > generic 430/431/434 which always hung): no regressions w/ > data=ordered (default) nor data=journal,journal_checksum. > > With data=ordered: (on both original and patched kernel) > > Failures: generic/223 generic/465 generic/553 generic/554 generic/565 generic/570 > > With data=journal,journal_checksum: (likewise) > > Failures: ext4/044 generic/223 generic/441 generic/553 generic/554 generic/565 generic/570 > > The test-case for the problem (and deadlocks) and further > stress testing is stress-ng (with 512 workers on 16 vCPUs) > > $ sudo mount -o data=journal,journal_checksum $DEV $MNT > $ cd $MNT > $ sudo stress-ng --mmap 512 --mmap-file --timeout 1w > > To reproduce the problem (without patchset), run it a bit > and crash the kernel (to cause unclean shutdown) w/ sysrq, > and mount the device again (it should fail / need e2fsck): > > Original: > > [ 27.660063] JBD2: Invalid checksum recovering data block 79449 in log > [ 27.792371] JBD2: recovery failed > [ 27.792854] EXT4-fs (vdc): error loading journal > mount: /tmp/ext4: can't read superblock on /dev/vdc. > > Patched: > > [ 33.111230] EXT4-fs (vdc): 512 orphan inodes deleted > [ 33.111961] EXT4-fs (vdc): recovery complete > [ 33.114590] EXT4-fs (vdc): mounted filesystem with journalled data mode. Opts: data=journal,journal_checksum > > > RFC / Questions: > --------------- > > 0) Usage of ext4_inode_info.i_datasync_tid for checks > > We rely on the struct ext4_inode_info.i_datasync_tid field > (set by __ext4_journalled_writepage() and others) to check > it against the running transaction. Of course, other sites > set it too, and it might be that some of our checks return > false positives then (should be fine, just less efficient.) > > To avoid such false positives, we could add another field > to that structure, exclusively for this, but that is more > 8 bytes (pointer) for inodes and even on non-data=journal > cases.. so it didn't seem good enough reason, but if that > is better/worth it for efficiency reasons (speed, in this > case, vs. memory consumption) we could do it. > > Maybe there are other ideas/better ways to do it? > > 1) Usage of ext4_force_commit() in ext4_writepages() > > Patch 03 describes/fixes an issue where the underlying problem is, > if __ext4_journalled_writepage() does set_page_writeback() but no > journal commit is triggered, wait_on_page_writeback() may wait up > to seconds until the periodic journal commit happens. > > The solution there, to fix the impact on msync(), is to just call > ext4_force_commit() (as it's done anyway in ext4_sync_file()), on > ext4_writepages(). > > Is that a good enough solution? Other ideas? > > 2) Similar issue (unhandled) in ext4_writepage() > > The other, related question is, what about direct callers of > ext4_writepage() that obviously do not use ext4_writepages() ? > (e.g., pageout() and writeout(); write_one_page() not used.) > > Those are memory-cleasing writeback, which should not wait, > however, as mentioned in that patch, if its writeback goes > on for seconds and an data-integrity writeback/system call > comes in, it is delayed/wait_on_page_writeback() that long. > > So, ideally, we should be trying to kick a journal commit? > > It looks like ext4_handle_sync() is not the answer, since > it waits for commit to finish, and pageout() is called on > a list of pages by shrinking. So, not effective to block > on each one of them. > > We might not want to start anything right now, actually, > since the memory-cleasing writeback can be happening on > memory pressure scenarios, right? But would need to do > something, to ensure that future wait_on_page_writeback() > do not wait too long. > > Maybe the answer is something similar to jbd2 sync transaction > batching (used by ext4_handle_sync()), but in *async* fashion, > say, possibly implemented/batching in the jbd2 worker thread. > Is that reasonable? > > ... > > Any comments/feedback/reviews are very appreciated. > > Thank you in advance, > Mauricio > > [1] https://lore.kernel.org/linux-ext4/20190830012236.GC10779@xxxxxxx/ > > Mauricio Faria de Oliveira (11): > ext4: data=journal: introduce struct/kmem_cache > ext4_journalled_wb_page/_cachep > ext4: data=journal: handle page writeback in > __ext4_journalled_writepage() > ext4: data=journal: call ext4_force_commit() in ext4_writepages() for > msync() > ext4: data=journal: introduce helpers for journalled writeback > deadlock > ext4: data=journal: prevent journalled writeback deadlock in > __ext4_journalled_writepage() > ext4: data=journal: prevent journalled writeback deadlock in > ext4_write_begin() > ext4: grab page before starting transaction handle in > ext4_convert_inline_data_to_extent() > ext4: data=journal: prevent journalled writeback deadlock in > ext4_convert_inline_data_to_extent() > ext4: grab page before starting transaction handle in > ext4_try_to_write_inline_data() > ext4: deduplicate code with error legs in > ext4_try_to_write_inline_data() > ext4: data=journal: prevent journalled writeback deadlock in > ext4_try_to_write_inline_data() > > fs/ext4/ext4_jbd2.h | 88 +++++++++++++++++++++++++ > fs/ext4/inline.c | 153 +++++++++++++++++++++++++++++++------------- > fs/ext4/inode.c | 137 +++++++++++++++++++++++++++++++++++++-- > fs/ext4/page-io.c | 11 ++++ > 4 files changed, 341 insertions(+), 48 deletions(-) > > -- > 2.20.1 > -- Mauricio Faria de Oliveira