On Fri, Mar 18, 2022 at 11:56:04AM -0700, Linus Torvalds wrote: > On Fri, Mar 18, 2022 at 6:16 AM Jan Kara <jack@xxxxxxx> wrote: > > > > I agree with Dave that 'keep_towrite' thing is kind of self-inflicted > > damage on the ext4 side (we need to write out some blocks underlying the > > page but cannot write all from the transaction commit code, so we need to > > keep xarray tags intact so that data integrity sync cannot miss the page). > > Also it is no longer needed in the current default ext4 setup. But if you > > have blocksize < pagesize and mount the fs with 'dioreadlock,data=ordered' > > mount options, the hack is still needed AFAIK and we don't have a > > reasonable way around it. > > I assume you meant 'dioread_lock'. > > Which seems to be the default (even if 'data=ordered' is not). That's not quite right. data=ordered is the default, and that has been the case since ext3 was introduced. "dioread_lock" was the default in the days of ext3; "dioread_nolock" was added to allow parallel Direct I/O reads (hence "nolock"). A while back, we tried to make dioread_nolock the default since it tends improve performance for workloads that have a mix of writes that should be affected by fsyncs, and those that shouldn't. Howevver, we had to revert that change when blocksize < pagesize to work around a problem found on Power machines where "echo 3 > drop_caches" on the host appears to cause file system corruptions on the guest. (Commit 626b035b816b "ext4: don't set dioread_nolock by default for blocksize < pagesize"). > IOW, we could simply warn about "data=ordered is no longer supported" > and turn it into data=journal. > > Obviously *only* do this for the case of "blocksize < PAGE_SIZE". Actiavelly, we've discussed a number of times doing the reverse --- removing data=journal entirely, since it's (a) rarely used, and (b) data=journal disables a bunch of optimizations, including preallocation, and so its performance is pretty awful for most workloads. The main reason why haven't until now is that we believe there is a small number of people who do find data=journal useful for their workload, and at least _so_ far it's not been that hard to keep it limping along --- although in most cases, data=journal doesn't get supported for new features or performance optimizations, and it definitely does note get as much testing. So the thing that I've been waiting to do for a while is to replace the whole data=ordered vs data=writeback and dioread_nolock and dioread_lock is a complete reworking of the ext4 buffered writeback path, where we write the data blocks *first*, and only then update the ext4 metadata. Historically, going as far back as ext2, we've always allocated data blocks and updatted the metadata blocks, and only then updated the buffer or page cache for the data blocks. All of the complexities around data=ordered, data=writeback, dioread_nolock, etc., is because we haven't done the fundamental work of reversing the order in which we do buffered writeback. What we *should* be doing is: *) Determining where the new allocated data blockblocks should be, and preventing those blocks from being used for any other purposes, but *not* updating the file system metadata to reflect that change. *) Submit the data block write *) On write completion, update the metadata blocks in a kernel thread. Over time, we've been finding more and more reasons why I need to do this work, so it's something I'm going to have to prioritize in the next few months. Cheerse, - Ted