On Wed, 2022-08-10 at 11:58 +1000, Dave Chinner wrote: > On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote: > > On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote: > > > Recent parent pointer testing has exposed a bug in the underlying > > > attr replay. A multi transaction replay currently performs a > > > single step of the replay, then deferrs the rest if there is more > > > to do. > > Yup. > > > > This causes race conditions with other attr replays that > > > might be recovered before the remaining deferred work has had a > > > chance to finish. > > What other attr replays are we racing against? There can only be > one incomplete attr item intent/done chain per inode present in log > recovery, right? No, a rename queues up a set and remove before committing the transaction. One for the new parent pointer, and another to remove the old one. It cant be an attr replace because technically the names are different. So the recovered set grows the leaf, and returns the egain, then rest gets capture committed. Next up is the recovered remove which pulls out the fork, which causes problems when the rest of the set operation resumes as a deferred operation. Here is the link to the original discussion, it was quite a while ago: https://lore.kernel.org/all/Yrzw9F5aGsaldrmR@magnolia/ I hope that helps? Allison > > > > This can lead to interleaved set and remove > > > operations that may clobber the attribute fork. Fix this by > > > deferring all work for any attribute operation. > > Which means this should be an impossible situation. > > That is, if we crash before the final attrd DONE intent is written > to the log, it means that new attr intents for modifications made > *after* the current attr modification was completed will not be > present in the log. We have strict ordering of committed operations > in the journal, hence an operation on an inode has an incomplete > intent *must* be the last operation and the *only* incomplete intent > that is found in the journal for that inode. > > Hence from an operational ordering persepective, this explanation > for issue being seen doesn't make any sense to me. If there are > multiple incomplete attri intents then we've either got a runtime > journalling problem (a white-out issue? failing to relog the inode > in each new intent?) or a log recovery problem (failing to match > intent-done pairs correctly?), not a recovery deferral issue. > > Hence I think we're still looking for the root cause of this > problem... > > > > Signed-off-by: Allison Henderson <allison.henderson@xxxxxxxxxx> > > > --- > > > fs/xfs/xfs_attr_item.c | 35 ++++++++--------------------------- > > > 1 file changed, 8 insertions(+), 27 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c > > > index 5077a7ad5646..c13d724a3e13 100644 > > > --- a/fs/xfs/xfs_attr_item.c > > > +++ b/fs/xfs/xfs_attr_item.c > > > @@ -635,52 +635,33 @@ xfs_attri_item_recover( > > > break; > > > case XFS_ATTRI_OP_FLAGS_REMOVE: > > > if (!xfs_inode_hasattr(args->dp)) > > > - goto out; > > > + return 0; > > > attr->xattri_dela_state = > > > xfs_attr_init_remove_state(args); > > > break; > > > default: > > > ASSERT(0); > > > - error = -EFSCORRUPTED; > > > - goto out; > > > + return -EFSCORRUPTED; > > > } > > > > > > xfs_init_attr_trans(args, &tres, &total); > > > error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, > > > &tp); > > > if (error) > > > - goto out; > > > + return error; > > > > > > args->trans = tp; > > > done_item = xfs_trans_get_attrd(tp, attrip); > > > + args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE; > > > + set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags); > > > > > > xfs_ilock(ip, XFS_ILOCK_EXCL); > > > xfs_trans_ijoin(tp, ip, 0); > > > > > > - error = xfs_xattri_finish_update(attr, done_item); > > > - if (error == -EAGAIN) { > > > - /* > > > - * There's more work to do, so add the intent item to > > > this > > > - * transaction so that we can continue it later. > > > - */ > > > - xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr- > > > >xattri_list); > > > - error = xfs_defer_ops_capture_and_commit(tp, > > > capture_list); > > > - if (error) > > > - goto out_unlock; > > > - > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > - xfs_irele(ip); > > > - return 0; > > > - } > > > - if (error) { > > > - xfs_trans_cancel(tp); > > > - goto out_unlock; > > > - } > > > - > > > + xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list); > > > > This seems a little convoluted to me. Maybe? Maybe not? > > > > 1. Log recovery recreates an incore xfs_attri_log_item from what it > > finds in the log. > > > > 2. This function then logs an xattrd for the recovered xattri item. > > > > 3. Then it creates a new xfs_attr_intent to complete the operation. > > > > 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a > > new > > xattri for the intent created in step 3 and also commits the xattrd > > for > > the first xattri. > > > > IOWs, the only difference between before and after is that we're > > not > > advancing one more step through the state machine as part of log > > recovery. From the perspective of the log, the recovery function > > merely > > replaces the recovered xattri log item with a new one. > > > > Why can't we just attach the recovered xattri to the > > xfs_defer_pending > > that is created to point to the xfs_attr_intent that's created in > > step > > 3, and skip the xattrd? > > Remember that attribute intents are different to all other intent > types that we have. The existing extent based intents define a > single indepedent operation that needs to be performed, and each > step of the intent chain is completely independent of the previous > step in the chain. e.g. removing the extent from the rmap btree is > completely independent of removing it from the inode bmap btree - > all that matters is that the removal from the bmbt happens first. > The rmapbt removal can happen at any time after than, and is > completely independent of any other bmbt or rmapbt operation. > Similarly, the EFI can processed independently of all bmapbt and > rmapbt modifications, it just has to happen after those > modifications are done. > > Hence if we crash during recovery, we can just restart from > where-ever we got to in the middle of the intent chains and not have > to care at all. IOWs, eventual consistency works with these chains > because there is no dependencies between each step of the intent > chain and each step is completely independent of the other steps. > > Attribute intent chains are completely different. They link steps in > a state machine together in a non-trivial, highly dependent chain. > We can't just restart the chain in the middle like we can for the > BUI->RUI->CUI->EFI chain because the on-disk attribute is in an > unknown state and recovering that exact state is .... complex. > > Hence the the first step of recovery is to return the attribute we > are trying to modify back to a known state. That means we have to > perform a removal of any existing attribute under that name first. > Hence this first step should be replacing the existing attr intent > with the intent that defines the recovery operation we are going to > perform. > > That means we need to translate set to replace so that cleanup is > run first, replace needs to clean up the attr under that name > regardless of whether it has the incomplete bit set on it or not. > Remove is the only operation that runs the same as at runtime, as > cleanup for remove is just repeating the remove operation from > scratch. > > > I /think/ the answer to that question is that we might need to move > > the > > log tail forward to free enough log space to finish the intent > > items, so > > creating the extra xattrd/xattri (a) avoid the complexity of > > submitting > > an incore intent item *and* a log intent item to the defer ops > > machinery; and (b) avoid livelocks in log recovery. Therefore, we > > actually need to do it this way. > > We really need the initial operation to rewrite the intent to match > the recovery operation we are going to perform. Everything else is > secondary. > > Cheers, > > Dave.