tl;dr: I may have found all the cache flush problems that I'm reproducing with generic/482. The last two failures I've RCA'd have been repair complaining about invalid attribute because attr fork setup is non-atomic (xfs_attr_shortform_to_leaf() on an empty attr fork). I'm still running tests to see if another cache flush related failure occurs, but I'm more confident than I was yesterday that the light I see isn't another train... The first seven patches in this series fix regressions introduced by commit eef983ffeae7 ("xfs: journal IO cache flush reductions") and exposed by generic/482. I have one test machine that reproduces the failures, and in general a failure occurs 1 in every 30 test iterations. The first two patches are related to external logs. These were found by code inspection while looking for the cause of the generic/482 failures. The third and fourth patch addresses a race condition between the new unconditional async data device cache flush run by the CIL push and the log tail changing between that flush completing and the commit iclog being synced. In this situation, if the commit_iclog doesn't issue a REQ_PREFLUSH, we fail to guarantee that the metadata IO completions that moved the log tail are on stable storage. The fix for this issue is to sample the log tail before we issue the async cache flush, and if the log tail has changed when we release the commit_iclog we set the XLOG_ICL_NEED_FLUSH flag on the iclog to guarantee a cache flush is issued before the commit record that moves the tail of the log forward is written to the log. The fifth and sixth patches address an oversight about log forces. Log forces imply a cache flush if they result in iclog IO being issued, allowing the caller to elide cache flushes that they might require for data integrity. The change to requiring the iclog owner to signal that a cache flush is required completely missed the log force code. This patch ensures that all ACTIVE and WANT_SYNC iclogs that we either flush or wait on issue cache flushes and so guarantee that they are on stable storage at IO completion before they signal completion to log force waiters. The seventh patch is not a regression, but a fix for another log force iclog related flush problem I noticed and found while triaging the above problems. Occasionally the tests would sit idle doing nothing for up to 30s waiting on a xfs_log_force_seq() call to complete. The trigger for them to wake up was the background log worker issuing a new log force. This appears to be another "incorrectly wait on future iclog state changes" situation, and it's notable that xfs_log_force() specifically handles this situation while xfs_log_force_seq() does not. Changing xfs_log_force_seq() to use the same "iclog IO already completed" detection as xfs_log_force() made those random delays go away. The eighth patch is a fix for inode logging being able to make the on-disk LSN in the inode go backwards in time. This is a regression from 2016 when we shrunk the inode by not tracking the on-disk inode LSN in memory any more. This can cause inodes to be replayed incorrectly, leading to silent corruption during log recovery. The ninth patch is a fix for a zero-day bug in the on-disk LSN log recovery ordering for attribute leaf blocks. We fail to recognise them and extract their LSN, so they are always recovered immediately. This can result in recovery taking them backwards in time, leading to silent corruption during log recovery. This has been around since recovery ordering was introduced for v5 filesystems way back in 2013.... The tenth patch is an addition to the iclog state tracing that dumps the iclog flags which tells me whether the NEED_FLUSH/NEED_FUA flags are set on the iclog as it is released for IO. This helps validate whether the iclog flushing is doing the right thing when a g/482 failure occurs with a suspected cache flushing issue. The last patch is a fix for the new race condition found. It's tricky and complex and requires a commit record in the iclog left in ACTIVE state with NEED_FUA set on it, then to have a log tail update occur after the next CIL push flush is done but before the iclog is flushed when the checkpoint overruns the remaining iclog space. Please consider this patchset for a 5.14-rc4 merge, because we need to get the obvious failures and regressions fixed sooner rather later. Version 3: - fixed typos in commits - Updated tail lsn vs flush comments on xlog_state_release_iclog() - Added new patch to fix another tail lsn vs iclog write race Version 2: - rebased on 5.14-rc3 - https://lore.kernel.org/linux-xfs/20210726060716.3295008-1-david@xxxxxxxxxxxxx/T/#t - split __xlog_state_release_iclog() folding into separate patch (Christoph) - Reinstated lost trace_xlog_iclog_syncing tracepoint. - split xlog_force_iclog() helper into separate patch. - In xlog_force_lsn(), don't set log_flushed for the WANT_SYNC case as there is no guarantee that the log has been flushed when we return if we are not waiting. - appended inode LSN logging fix to series from https://lore.kernel.org/linux-xfs/20210722015335.3063274-1-david@xxxxxxxxxxxxx/T/#t - Fixed inode recovery failure when two checkpoints have the same recovery LSN (start records in the same iclog) and an inode is logged in both checkpoints. The second transaction would see an equal LSN to the inode on disk and skip replay, even though it was required. - added more comments to indicate the LSN in the xfs_log_dinode is never valid for recovery or replay purposes - added new patch for ATTR3_LEAF magic number detection and recovery (yet another zero day!). Version 1: - based on 5.14-rc1 - https://lore.kernel.org/linux-xfs/20210722015335.3063274-1-david@xxxxxxxxxxxxx/T/#t