Re: [RFC PATCH] xfs: Prevent umount from indefinitely waiting on XFS_IFLUSHING flag on stale inodes

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 18 Sep 2024 11:20:12 +1000

On Mon, Sep 16, 2024 at 11:14:32AM +0530, Chandan Babu R wrote:
> On Thu, Sep 05, 2024 at 06:12:29 PM +0530, Chandan Babu R wrote:
> >>> To overcome this bug, this commit removes the check for log shutdown during
> >>> high level transaction commit operation. The log items in the high level
> >>> transaction will now be committed to the CIL despite the log being
> >>> shutdown. This will allow the CIL processing logic (i.e. xlog_cil_push_work())
> >>> to invoke xlog_cil_committed() as part of error handling. This will cause
> >>> xfs_buf log item to to be unpinned and the corresponding inodes to be aborted
> >>> and have their XFS_IFLUSHING flag cleared.
> >>
> >> I don't know exactly how the problem arose, but I can say for
> >> certain that the proposed fix is not valid.  Removing that specific
> >> log shutdown check re-opens a race condition which can causes on
> >> disk corruption. The shutdown was specifically placed to close that
> >> race - See commit 3c4cb76bce43 ("xfs: xfs_trans_commit() path must
> >> check for log shutdown") for details.
> >>
> >> I have no idea what the right way to fix this is yet, but removing
> >> the shutdown check isn't it...
> >>
> 
> Commit 3c4cb76bce43 describes the following scenario,
> 
> 1. Filesystem is shutdown but the log remains operational.
> 2. High-level transaction commit (i.e. xfs_trans_commit()) notices the fs
>    shutdown. Hence it aborts the dirty log items. One of the log items being
>    aborted is an inode log item.
> 3. An inode cluster writeback is executed. Here, we come across the previously
>    aborted inode log item. The inode log item is currently unpinned and
>    dirty. Hence, the inode is included in the cluster buffer writeback.
> 4. Cluster buffer IO completion tries to remove the inode log item from the
>    AIL and hence trips over an assert statement since the log item was never
>    on the AIL. This indicates that the inode was never written to the journal.
> 
> Hence the commit 3c4cb76bce43 will abort the transaction commit only when the
> log has been shutdown.
> 
> With the log shutdown check removed, we can end up with the following cases
> during high-level transaction commit operation,
> 1. The filesystem is shutdown while the log remains operational.
>    In this case, the log items are committed to the CIL where they are pinned
>    before unlocking them.
>    This should prevent the inode cluster writeback code
>    from including such an inode for writeback since the corresponding log item
>    is pinned. From here onwards, the normal flow of log items from the CIL to
>    the AIL occurs after the contents of the log items are written to the
>    journal and then later unpinned.
>    The above logic holds true even without applying the "RFC patch" presented
>    in the mail.
>  
> 2. The log is shutdown.
>    As in the previous case, the log items are moved to the CIL where they are
>    pinned before unlocking them. The pinning of the log items prevents
>    the inode cluster writeback code from including the pinned inode in its
>    writeback operation. These log items are then processed by
>    xlog_cil_committed() which gets invoked as part of error handling by
>    xlog_cil_push_work().

It's more complex than that.

3. We get an error returned from xfs_defer_finish_noroll() or
xfs_trans_run_precommits().

In these cases we skip the insertion into the CIL altogether and
cancel the dirty transaction with dirty log items exactly as we
currently do right now.

This is why, in that series of commits, I changed all the shutdown
checks for metadata operations to use xlog_is_shutdown() as opposed
to xfs_is_shutdown() - to ensure that shutdown checks are consistent
with what __xfs_trans_commit() required. Hence these paths would all
behave the same way when dirty transactions are canceled, and that
leads to the next point:

4. Random high level code that does this:

	if (xlog_is_shutdown(log))
		xfs_trans_cancel(tp);

has the same problem - we abort dirty items and release them when
the log has been shut down. In these cases, we have no possible
mechanism to insert them into the CIL to avoid the same shutdown
race condition.

IOWs, removing the shutdown check from __xfs_trans_commit() doesn't
resolve the problem entirely as there are other paths that lead to
the same situation where we cancel transactions due to the log being
shut down.

5.  An the architectural issue with the fix: inserting the items
into the CIL instead of aborting them requires something or someone
to push the CIL to error out the items and release them.

Who is responsible for invoking that push?

It's not xlog_cil_commit(). That only pushes on the CIL if the
amount of queued work is over the push threshold, otherwise it will
not queue xlog_cil_push_work() to run again.

Hence every call to xfs_trans_commit() is now absolutely reliant on
some future code pushing the CIL to free the items that were
committed to the CIL after the log is shut down.

I think that is eventually caught by the xfs_log_unmount() path
doing a log force. i.e.

xfs_log_unmount
  xfs_log_clean
    xfs_log_quiesce
      xfs_log_force
        xlog_cil_force
	  xlog_cil_force_seq
	    xlog_cil_push_now

However, this is reliant on there being no shutdown checks in this
path, and xlog_cil_push_work() avoiding any shutdown checks until
it's done all the work needed to be able to cancle the pending log
items, right?

IOWs, there's lots of things that have to be done just right for
this "future cleanup" mechanism to work, and we have to be very
careful not to place a shutdown check in just the wrong spot
otherwise we can break the shutdown processing.

We know how problematic this can be - this "cleanup is somebody
else's future problem" model is how we used to handle shutdown and
it was an -utter mess-.  Christoph and I spent a lot of time years
ago fixing the shutdown mess by moving to a model where the current
holder of an item is responsible for releasing that item when a
shutdown is detected. That included xfs_trans_commit().

This change of shutdown processing model was one of the reasons that
tests like generic/388 (and other fstress+shutdown tests) are
largely reliable now - they don't crash, hang, leak or trigger UAF
errors all the time like they used to. Shutdown reliability used to
be -horrible-.

Hence I really don't want to see us moving back towards a "cleanup is
somebody else's future problem" model, regardless of whether it
works in this case or not, because it's proven to be a broken model.

It's probably obvious by now that racing shutdowns with cancelling
dirty transaction safely isn't a simple problem.  That's why I said
"I have no idea what the right way to fix this is yet" - I didn't
have time to write a long, long email explaining all this.

I -think- that the ->iop_shutdown() mechanism I described to
Christoph recently can be applied here, too. That provided a
mechanism for the AIL push to be able to process items safely once a
shutdown had been detected, and I think it applies equally here.
i.e. we know the log has been shut down, and so we need to error out
the dirty log items via IO error mechanisms rather than simply
aborting the log items and hoping everyone notices that it's been
aborted....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx