Re: writeback completion soft lockup BUG in folio_wake_bit()

"Theodore Ts'o" <tytso@xxxxxxx> · Sat, 19 Mar 2022 12:23:04 -0400

On Fri, Mar 18, 2022 at 11:56:04AM -0700, Linus Torvalds wrote:
> On Fri, Mar 18, 2022 at 6:16 AM Jan Kara <jack@xxxxxxx> wrote:
> >
> > I agree with Dave that 'keep_towrite' thing is kind of self-inflicted
> > damage on the ext4 side (we need to write out some blocks underlying the
> > page but cannot write all from the transaction commit code, so we need to
> > keep xarray tags intact so that data integrity sync cannot miss the page).
> > Also it is no longer needed in the current default ext4 setup. But if you
> > have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered'
> > mount options, the hack is still needed AFAIK and we don't have a
> > reasonable way around it.
> 
> I assume you meant 'dioread_lock'.
> 
> Which seems to be the default (even if 'data=ordered' is not).

That's not quite right.  data=ordered is the default, and that has
been the case since ext3 was introduced.

"dioread_lock" was the default in the days of ext3; "dioread_nolock"
was added to allow parallel Direct I/O reads (hence "nolock").  A
while back, we tried to make dioread_nolock the default since it tends
improve performance for workloads that have a mix of writes that
should be affected by fsyncs, and those that shouldn't.

Howevver, we had to revert that change when blocksize < pagesize to
work around a problem found on Power machines where "echo 3 >
drop_caches" on the host appears to cause file system corruptions on
the guest.  (Commit 626b035b816b "ext4: don't set dioread_nolock by
default for blocksize < pagesize").

> IOW, we could simply warn about "data=ordered is no longer supported"
> and turn it into data=journal.
> 
> Obviously *only* do this for the case of "blocksize < PAGE_SIZE".

Actiavelly, we've discussed a number of times doing the reverse ---
removing data=journal entirely, since it's (a) rarely used, and (b)
data=journal disables a bunch of optimizations, including
preallocation, and so its performance is pretty awful for most
workloads.  The main reason why haven't until now is that we believe
there is a small number of people who do find data=journal useful for
their workload, and at least _so_ far it's not been that hard to keep
it limping along --- although in most cases, data=journal doesn't get
supported for new features or performance optimizations, and it
definitely does note get as much testing.

So the thing that I've been waiting to do for a while is to replace
the whole data=ordered vs data=writeback and dioread_nolock and
dioread_lock is a complete reworking of the ext4 buffered writeback
path, where we write the data blocks *first*, and only then update the
ext4 metadata.

Historically, going as far back as ext2, we've always allocated data
blocks and updatted the metadata blocks, and only then updated the
buffer or page cache for the data blocks.  All of the complexities
around data=ordered, data=writeback, dioread_nolock, etc., is because
we haven't done the fundamental work of reversing the order in which
we do buffered writeback.   What we *should* be doing is:

*) Determining where the new allocated data blockblocks should be, and
   preventing those blocks from being used for any other purposes, but
   *not* updating the file system metadata to reflect that change.

*) Submit the data block write

*) On write completion, update the metadata blocks in a kernel thread.

Over time, we've been finding more and more reasons why I need to do
this work, so it's something I'm going to have to prioritize in the
next few months.

Cheerse,

						- Ted