Re: [RFC PATCH 00/11] ext4: data=journal: writeback mmap'ed pagecache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On May 15, 2020, at 12:39 PM, Mauricio Faria de Oliveira <mfo@xxxxxxxxxxxxx> wrote:
> 
> Hi,
> 
> On Thu, Apr 23, 2020 at 8:37 PM Mauricio Faria de Oliveira
> <mfo@xxxxxxxxxxxxx> wrote:
> [snip]
>> Summary:
>> -------
>> 
>> The patchset is a bit long with 11 patches, but I tried to get
>> changes tiny to help with review, and better document how each
>> of them work, why and how this or that is done.  It's RFC as I
>> would like to ask for suggestions/feedback, if at all possible.
> 
> If at all possible, may this patchset have at least a cursory look?
> 
> I'm aware it's been a busy period for some of you, so I just wanted
> to friendly ping on it, in case this got buried deep under other stuff.


Hi Mauricio,
thank you for submitting the patch.  I've always thought data=journal
is a useful feature, especially for the future with host-managed SMR
(which seems is being added to drives whether we want it or not), and
hybrid use cases (NVMe journal, large HDD size which may also be SMR).

That said, diving into the ext4/jbd2 code on your first adventure here
is certainly ambitious and it is great that you have taken this fix on.

I'd say that Jan is the resident expert on jbd2, so hopefully he will
have a chance to look at this patch series.

Cheers, Andreas

>> Patch 01 and 02 implement the outlined fix, with a few changes
>> (fix first deadlock; use existing plumbing in jbd2 as the list.)
>> 
>> Patch 03 fix a seconds-delay on msync().
>> 
>> Patch 04 introduces helpers to handle the second deadlock.
>> 
>> Patch 05-11 handle the second deadlock (three of these patches,
>> namely 07, 09 and 10 are changes not specific for data=journal,
>> affecting other journaling modes, so it's not on their subject.)
>> 
>> The order of the patches intentionally allow the issues on 03
>> and 05-11 to occur (while putting the core patches first), so
>> to allow issues to be reproduced/regression tested one by one,
>> as needed.  It can be changed, of course, so to enable actual
>> writeback changes in the last patch (when issues are patched.)
>> 
>> 
>> Testing:
>> -------
>> 
>> This has been built and regression tested on next-20200417.
>> (Also rebased and build tested on next-20200423 / "today").
>> 
>> On xfstests (commit b2faf204) quick group (and excluding
>> generic 430/431/434 which always hung): no regressions w/
>> data=ordered (default) nor data=journal,journal_checksum.
>> 
>> With data=ordered: (on both original and patched kernel)
>> 
>>    Failures: generic/223 generic/465 generic/553 generic/554 generic/565 generic/570
>> 
>> With data=journal,journal_checksum: (likewise)
>> 
>>    Failures: ext4/044 generic/223 generic/441 generic/553 generic/554 generic/565 generic/570
>> 
>> The test-case for the problem (and deadlocks) and further
>> stress testing is stress-ng (with 512 workers on 16 vCPUs)
>> 
>>    $ sudo mount -o data=journal,journal_checksum $DEV $MNT
>>    $ cd $MNT
>>    $ sudo stress-ng --mmap 512 --mmap-file --timeout 1w
>> 
>> To reproduce the problem (without patchset), run it a bit
>> and crash the kernel (to cause unclean shutdown) w/ sysrq,
>> and mount the device again (it should fail / need e2fsck):
>> 
>> Original:
>> 
>>    [   27.660063] JBD2: Invalid checksum recovering data block 79449 in log
>>    [   27.792371] JBD2: recovery failed
>>    [   27.792854] EXT4-fs (vdc): error loading journal
>>    mount: /tmp/ext4: can't read superblock on /dev/vdc.
>> 
>> Patched:
>> 
>>    [  33.111230] EXT4-fs (vdc): 512 orphan inodes deleted
>>    [  33.111961] EXT4-fs (vdc): recovery complete
>>    [  33.114590] EXT4-fs (vdc): mounted filesystem with journalled data mode. Opts: data=journal,journal_checksum
>> 
>> 
>> RFC / Questions:
>> ---------------
>> 
>> 0) Usage of ext4_inode_info.i_datasync_tid for checks
>> 
>> We rely on the struct ext4_inode_info.i_datasync_tid field
>> (set by __ext4_journalled_writepage() and others) to check
>> it against the running transaction. Of course, other sites
>> set it too, and it might be that some of our checks return
>> false positives then (should be fine, just less efficient.)
>> 
>> To avoid such false positives, we could add another field
>> to that structure, exclusively for this, but that is more
>> 8 bytes (pointer) for inodes and even on non-data=journal
>> cases.. so it didn't seem good enough reason, but if that
>> is better/worth it for efficiency reasons (speed, in this
>> case, vs. memory consumption) we could do it.
>> 
>> Maybe there are other ideas/better ways to do it?
>> 
>> 1) Usage of ext4_force_commit() in ext4_writepages()
>> 
>> Patch 03 describes/fixes an issue where the underlying problem is,
>> if __ext4_journalled_writepage() does set_page_writeback() but no
>> journal commit is triggered, wait_on_page_writeback() may wait up
>> to seconds until the periodic journal commit happens.
>> 
>> The solution there, to fix the impact on msync(), is to just call
>> ext4_force_commit() (as it's done anyway in ext4_sync_file()), on
>> ext4_writepages().
>> 
>> Is that a good enough solution?  Other ideas?
>> 
>> 2) Similar issue (unhandled) in ext4_writepage()
>> 
>> The other, related question is, what about direct callers of
>> ext4_writepage() that obviously do not use ext4_writepages() ?
>> (e.g., pageout() and writeout(); write_one_page() not used.)
>> 
>> Those are memory-cleasing writeback, which should not wait,
>> however, as mentioned in that patch, if its writeback goes
>> on for seconds and an data-integrity writeback/system call
>> comes in, it is delayed/wait_on_page_writeback() that long.
>> 
>> So, ideally, we should be trying to kick a journal commit?
>> 
>> It looks like ext4_handle_sync() is not the answer, since
>> it waits for commit to finish, and pageout() is called on
>> a list of pages by shrinking.  So, not effective to block
>> on each one of them.
>> 
>> We might not want to start anything right now, actually,
>> since the memory-cleasing writeback can be happening on
>> memory pressure scenarios, right?  But would need to do
>> something, to ensure that future wait_on_page_writeback()
>> do not wait too long.
>> 
>> Maybe the answer is something similar to jbd2 sync transaction
>> batching (used by ext4_handle_sync()), but in *async* fashion,
>> say, possibly implemented/batching in the jbd2 worker thread.
>> Is that reasonable?
>> 
>> ...
>> 
>> Any comments/feedback/reviews are very appreciated.
>> 
>> Thank you in advance,
>> Mauricio
>> 
>> [1] https://lore.kernel.org/linux-ext4/20190830012236.GC10779@xxxxxxx/
>> 
>> Mauricio Faria de Oliveira (11):
>>  ext4: data=journal: introduce struct/kmem_cache
>>    ext4_journalled_wb_page/_cachep
>>  ext4: data=journal: handle page writeback in
>>    __ext4_journalled_writepage()
>>  ext4: data=journal: call ext4_force_commit() in ext4_writepages() for
>>    msync()
>>  ext4: data=journal: introduce helpers for journalled writeback
>>    deadlock
>>  ext4: data=journal: prevent journalled writeback deadlock in
>>    __ext4_journalled_writepage()
>>  ext4: data=journal: prevent journalled writeback deadlock in
>>    ext4_write_begin()
>>  ext4: grab page before starting transaction handle in
>>    ext4_convert_inline_data_to_extent()
>>  ext4: data=journal: prevent journalled writeback deadlock in
>>    ext4_convert_inline_data_to_extent()
>>  ext4: grab page before starting transaction handle in
>>    ext4_try_to_write_inline_data()
>>  ext4: deduplicate code with error legs in
>>    ext4_try_to_write_inline_data()
>>  ext4: data=journal: prevent journalled writeback deadlock in
>>    ext4_try_to_write_inline_data()
>> 
>> fs/ext4/ext4_jbd2.h |  88 +++++++++++++++++++++++++
>> fs/ext4/inline.c    | 153 +++++++++++++++++++++++++++++++-------------
>> fs/ext4/inode.c     | 137 +++++++++++++++++++++++++++++++++++++--
>> fs/ext4/page-io.c   |  11 ++++
>> 4 files changed, 341 insertions(+), 48 deletions(-)
>> 
>> --
>> 2.20.1
>> 
> 
> 
> --
> Mauricio Faria de Oliveira


Cheers, Andreas





Attachment: signature.asc
Description: Message signed with OpenPGP


[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux