On Fri, Jan 12, 2024 at 01:42:32PM -0500, Brian Foster wrote: > On Fri, Jan 12, 2024 at 08:55:47PM +0800, Long Li wrote: > > On Thu, Jan 11, 2024 at 11:47:47AM +1100, Dave Chinner wrote: > > > On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote: > > > > In order to make sure that submits buffers on lsn boundaries in the > > > > abnormal paths, we need to check error status before submit buffers that > > > > have been added from the last record processed. If error status exist, > > > > buffers in the bufffer_list should be canceled. > > > > > > > > Canceling the buffers in the buffer_list directly isn't correct, unlike > > > > any other place where write list was canceled, these buffers has been > > > > initialized by xfs_buf_item_init() during recovery and held by buf > > > > item, buf items will not be released in xfs_buf_delwri_cancel(). If > > > > these buffers are submitted successfully, buf items assocated with > > > > the buffer will be released in io end process. So releasing buf item > > > > in write list cacneling process is needed. > > > > > > I still don't think this is correct. > > > > > > > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery") > > > > Signed-off-by: Long Li <leo.lilong@xxxxxxxxxx> > > > > --- > > > > fs/xfs/xfs_buf.c | 2 ++ > > > > fs/xfs/xfs_log_recover.c | 22 +++++++++++++--------- > > > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > index 8e5bd50d29fe..6a1b26aaf97e 100644 > > > > --- a/fs/xfs/xfs_buf.c > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel( > > > > xfs_buf_lock(bp); > > > > bp->b_flags &= ~_XBF_DELWRI_Q; > > > > xfs_buf_list_del(bp); > > > > + if (bp->b_log_item) > > > > + xfs_buf_item_relse(bp); > > > > xfs_buf_relse(bp); > > > > > > I still don't think this is safe. The buffer log item might still be > > > tracked in the AIL when the delwri list is cancelled, so the delwri > > > list cancelling cannot release the BLI without removing the item > > > from the AIL, too. The delwri cancelling walk really shouldn't be > > > screwing with AIL state, which means it can't touch the BLIs here. > > > > > > At minimum, it's a landmine for future users of > > > xfs_buf_delwri_cancel(). A quick look at the quotacheck code > > > indicates that it can cancel delwri lists that have BLIs in the AIL > > > (for newly allocated dquot chunks), so I think this is a real concern. > > > > > > This is one of the reasons for submitting the delwri list on error; > > > the IO completion code does all the correct cleanup of log items > > > including removing them from the AIL because the buffer is now > > > either clean or stale and no longer needs to be tracked by the AIL. > > > > Yes, it's not a safety solution. > > > > > > > > If the filesystem has been shut down, then delwri list submission > > > will error out all buffers on the list via IO submission/completion > > > and do all the correct cleanup automatically. > > > > > > I note that write IO errors during log recovery will cause immediate > > > shutdown of the filesytsem via xfs_buf_ioend_handle_error(): > > > > > > /* > > > * We're not going to bother about retrying this during recovery. > > > * One strike! > > > */ > > > if (bp->b_flags & _XBF_LOGRECOVERY) { > > > xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR); > > > return false; > > > } > > > > > > So I'm guessing that the IO error injection error that caused this > > > failure was on a buffer read part way through recovering items. > > > > > > Can you confirm that the failure is only seen after read IO error > > > injection and that write IO error injection causes immediate > > > shutdown and so avoids the problem altogether? > > > > This problem reproduce very hard, we reproduce it only three times. > > There may be several mounts between writing buffer not on LSN boundaries > > and reporting free space btree corruption, I can't distinguish the > > violation happend in which mount during test. So judging by the message > > I've reprodced, I can't confirm that the failure is only seen after read > > IO error injection. Look at one of the kernel message I've reprodced, > > there are several mount fails before reporting free space btree corruption, > > the reasons of mount fail include read IO error and write IO error. > > > > [51555.801349] XFS (dm-3): Mounting V5 Filesystem > > [51555.982130] XFS (dm-3): Starting recovery (logdev: internal) > > [51558.153638] FAULT_INJECTION: forcing a failure. > > name fail_make_request, interval 20, probability 1, space 0, times -1 > > [51558.153723] XFS (dm-3): log recovery read I/O error at daddr 0x3972 len 1 error -5 > > [51558.165996] XFS (dm-3): log mount/recovery failed: error -5 > > [51558.166880] XFS (dm-3): log mount failed > > [51558.410963] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk! > > [51558.410981] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk! > > [51558.413074] XFS (dm-3): Mounting V5 Filesystem > > [51558.595739] XFS (dm-3): Starting recovery (logdev: internal) > > [51559.592552] FAULT_INJECTION: forcing a failure. > > name fail_make_request, interval 20, probability 1, space 0, times -1 > > [51559.593008] XFS (dm-3): metadata I/O error in "xfs_buf_ioend_handle_error+0x170/0x760 [xfs]" at daddr 0x1879e0 len 32 error 5 > > [51559.593335] XFS (dm-3): Metadata I/O Error (0x1) detected at xfs_buf_ioend_handle_error+0x63c/0x760 [xfs] (fs/xfs/xfs_buf.c:1272). Shutting down filesystem. > > [51559.593346] XFS (dm-3): Please unmount the filesystem and rectify the problem(s) > > [51559.602833] XFS (dm-3): log mount/recovery failed: error -5 > > [51559.603772] XFS (dm-3): log mount failed > > [51559.835690] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk! > > [51559.835708] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk! > > [51559.837829] XFS (dm-3): Mounting V5 Filesystem > > [51560.024083] XFS (dm-3): Starting recovery (logdev: internal) > > [51562.155545] FAULT_INJECTION: forcing a failure. > > name fail_make_request, interval 20, probability 1, space 0, times -1 > > [51562.445074] XFS (dm-3): Ending recovery (logdev: internal) > > [51563.553960] XFS (dm-3): Internal error ltbno + ltlen > bno at line 1976 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_ag_extent+0x558/0xd80 [xfs] > > [51563.558629] XFS (dm-3): Corruption detected. Unmount and run xfs_repair > > > > > > > > If so, then all we need to do to handle instantiation side errors (EIO, ENOMEM, > > > etc) is this: > > > > > > /* > > > * Submit buffers that have been dirtied by the last record recovered. > > > */ > > > if (!list_empty(&buffer_list)) { > > > if (error) { > > > /* > > > * If there has been an item recovery error then we > > > * cannot allow partial checkpoint writeback to > > > * occur. We might have multiple checkpoints with the > > > * same start LSN in this buffer list, and partial > > > * writeback of a checkpoint in this situation can > > > * prevent future recovery of all the changes in the > > > * checkpoints at this start LSN. > > > * > > > * Note: Shutting down the filesystem will result in the > > > * delwri submission marking all the buffers stale, > > > * completing them and cleaning up _XBF_LOGRECOVERY > > > * state without doing any IO. > > > */ > > > xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); > > > } > > > error2 = xfs_buf_delwri_submit(&buffer_list); > > > } > > > > > > > This solution is also used in our internally maintained linux branch, > > and after several months of testing, the problem no longer arises. It > > seems safe and reasonable enough. > > > > I assume you're referring to the xfs_buf_delwri_cancel() change..? If I'm not referring to the xfs_buf_delwri_cancel() solution, but to the use of the xlog_force_shutdown() solution. We believe that the shutdown solution is less risky because buffer list can be cleanup automatically via IO submission/completion, it would not change any other logic, so we've used it in our internally maintained linux branch. On the other hand, I thought it would be better to positively cancel the buffer list, so I sent it out, but I overlooked potential issues... > so, this is a valid data point but doesn't necessarily help explain > whether the change is correct in any other context. I/O cancel probably > doesn't happen often for one, and even if it does, it's not totally > clear if you're reproducing a situation where the item might be AIL > resident or not at the time (or it is and you have a use after free that > goes undetected). And even if none of that is relevant, that still > doesn't protect against future code changes if this code doesn't respect > the established bli lifecycle rules. Yes, agree with you. > > IIRC, the bli_refcount (sampled via xfs_buf_item_relse()) is not > necessarily an object lifecycle refcount. It simply reflects whether the > item exists in a transaction where it might eventually be dirtied. This > is somewhat tricky, but can also be surmised from some of the logic in > xfs_buf_item_put(), for example. > > IOW, I think in the simple (non-recovery) case of a buffer being read, > modified and committed by a transaction, the bli would eventually end up > in a state where bli_refcount == 0 but is still resident in the AIL > until eventually written back by xfsaild. That metadata writeback > completion is what eventually frees the bli via xfs_buf_item_done(). > > So if I'm not mistaken wrt to the above example sequence, the > interesting question is if we suppose a buffer is in that intermediate > state of waiting for writeback, and then somebody were to hypothetically > execute a bit of code that simply added the associated buffer to a > delwri q and immediately cancelled it, what would happen with this > change in place? ISTM the remove would prematurely free the buffer/bli > while it's still resident in the AIL and pending writeback, thus > resulting in use-after-free or potential memory/list corruption, etc. Is > that not the case? Yes, if buf item still in the AIL, it is obviously not right to release the buf item. Thanks, Long Li