Re: [RFC PATCH] xfs: Prevent umount from indefinitely waiting on XFS_IFLUSHING flag on stale inodes

Chandan Babu R <chandanbabu@xxxxxxxxxx> · Sun, 22 Sep 2024 18:53:33 +0530

On Wed, Sep 18, 2024 at 11:20:12 AM +1000, Dave Chinner wrote:
> On Mon, Sep 16, 2024 at 11:14:32AM +0530, Chandan Babu R wrote:
>> On Thu, Sep 05, 2024 at 06:12:29 PM +0530, Chandan Babu R wrote:
>> >>> To overcome this bug, this commit removes the check for log shutdown during
>> >>> high level transaction commit operation. The log items in the high level
>> >>> transaction will now be committed to the CIL despite the log being
>> >>> shutdown. This will allow the CIL processing logic (i.e. xlog_cil_push_work())
>> >>> to invoke xlog_cil_committed() as part of error handling. This will cause
>> >>> xfs_buf log item to to be unpinned and the corresponding inodes to be aborted
>> >>> and have their XFS_IFLUSHING flag cleared.
>> >>
>> >> I don't know exactly how the problem arose, but I can say for
>> >> certain that the proposed fix is not valid.  Removing that specific
>> >> log shutdown check re-opens a race condition which can causes on
>> >> disk corruption. The shutdown was specifically placed to close that
>> >> race - See commit 3c4cb76bce43 ("xfs: xfs_trans_commit() path must
>> >> check for log shutdown") for details.
>> >>
>> >> I have no idea what the right way to fix this is yet, but removing
>> >> the shutdown check isn't it...
>> >>
>> 
>> Commit 3c4cb76bce43 describes the following scenario,
>> 
>> 1. Filesystem is shutdown but the log remains operational.
>> 2. High-level transaction commit (i.e. xfs_trans_commit()) notices the fs
>>    shutdown. Hence it aborts the dirty log items. One of the log items being
>>    aborted is an inode log item.
>> 3. An inode cluster writeback is executed. Here, we come across the previously
>>    aborted inode log item. The inode log item is currently unpinned and
>>    dirty. Hence, the inode is included in the cluster buffer writeback.
>> 4. Cluster buffer IO completion tries to remove the inode log item from the
>>    AIL and hence trips over an assert statement since the log item was never
>>    on the AIL. This indicates that the inode was never written to the journal.
>> 
>> Hence the commit 3c4cb76bce43 will abort the transaction commit only when the
>> log has been shutdown.
>> 
>> With the log shutdown check removed, we can end up with the following cases
>> during high-level transaction commit operation,
>> 1. The filesystem is shutdown while the log remains operational.
>>    In this case, the log items are committed to the CIL where they are pinned
>>    before unlocking them.
>>    This should prevent the inode cluster writeback code
>>    from including such an inode for writeback since the corresponding log item
>>    is pinned. From here onwards, the normal flow of log items from the CIL to
>>    the AIL occurs after the contents of the log items are written to the
>>    journal and then later unpinned.
>>    The above logic holds true even without applying the "RFC patch" presented
>>    in the mail.
>>  
>> 2. The log is shutdown.
>>    As in the previous case, the log items are moved to the CIL where they are
>>    pinned before unlocking them. The pinning of the log items prevents
>>    the inode cluster writeback code from including the pinned inode in its
>>    writeback operation. These log items are then processed by
>>    xlog_cil_committed() which gets invoked as part of error handling by
>>    xlog_cil_push_work().
>
> It's more complex than that.
>
> 3. We get an error returned from xfs_defer_finish_noroll() or
> xfs_trans_run_precommits().
>

Commit b5f17bec1213a3ed2f4d79ad4c566e00cabe2a9b has the following,

   2. xfs_force_shutdown() is used in places to cause the current
   modification to be aborted via xfs_trans_commit() because it may be
   impractical or impossible to cancel the transaction directly, and
   hence xfs_trans_commit() must cancel transactions when
   xfs_is_shutdown() is true in this situation. But we can't do that
   because of #1.

I see that xfs_trans_run_precommits() and xfs_defer_finish_noroll() both call
xfs_force_shutdown() when they encounter an error. But these functions could
have percolated the error code to __xfs_trans_commit() and let
__xfs_trans_commit() deal with the scenario more gracefully rather than
checking for log shutdown. Can you please explain why this is
impractical or impossible?

-- 
Chandan