On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Recovering an EFI currently uses a itruncate reservation, which is > designed for a rolling transaction that modifies the BMBT and > logs the EFI in one commit, then frees the space and logs the EFD in > the second commit. > > Recovering the EFI only requires the second transaction in this > pair, and hence has a smaller log space requirement than a truncate > operation. Hence when the extent free is being processed at runtime, > the log reservation that is held by the filesystem is only enough to > complete the extent free, not the entire truncate operation. > > Hence if the EFI pins the tail of the log and the log fills up while > the extent is being freed, the amount of reserved free space in the > log is not enough to start another entire truncate operation. Hence > if we crash at this point, log recovery will deadlock with the EFI > pinning the tail of the log and the log not having enough free space > to reserve an itruncate transaction. > > As such, EFI recovery needs it's own log space reservation separate > to the itruncate reservation. We only need what is required free the > extent, and this matches the space we have reserved at runtime for > this operation and hence should prevent the recovery deadlock from > occurring. > > This patch adds the new reservation in a way that minimises the > change such that it should be back-portable to older kernels easily. > Follow up patches will factor and rework the reservations to be more > correct and more tightly defined. > > Note: this would appear to be a generic problem with intent > recovery; we use the entire operation reservation for recovery, > not the reservation that was held at runtime after the intent was > logged. I suspect all intents are going to require their own > reservation as a result. > It might be worth explicitly pointing out that support for EFI/EFD intents goes farther back than the various intents associated with newer features, hence the value of a targeted fix. Otherwise the problem description makes sense and approach seems reasonable to me. Thanks for the writeup. Some questions... > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_trans_resv.c | 10 ++++++++++ > fs/xfs/libxfs/xfs_trans_resv.h | 2 ++ > fs/xfs/xfs_extfree_item.c | 2 +- > 3 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c > index d1a0848cb52e..da2ec052ac0a 100644 > --- a/fs/xfs/libxfs/xfs_trans_resv.c > +++ b/fs/xfs/libxfs/xfs_trans_resv.c > @@ -916,6 +916,16 @@ xfs_trans_resv_calc( > resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT; > resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES; > > + /* > + * Log recovery reservations for intent replay > + * > + * EFI recovery is itruncate minus the initial transaction that logs > + * logs the EFI. > + */ > + resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres; > + resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1; tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on whether reflink is enabled. On one hand this seems conservative enough, but do we know exactly what those extra unit counts are accounted for in the reflink case? It looks like extents are only freed when the last reference is dropped (otherwise we log a refcount intent), which makes me wonder whether we really need 7 log count units if recovery encounters an EFI. Also, while looking through the code I noticed that truncate does the following: ... error = xfs_defer_finish(&tp); if (error) goto out; error = xfs_trans_roll_inode(&tp, ip); if (error) goto out; ... ... which looks like it rolls the transaction an extra time per-extent. I don't think that contributes to this problem vs just being a runtime inefficiency, so maybe I'll fling a patch up for that separately. > + resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES; > + > /* > * The following transactions are logged in logical format with > * a default log count. > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h > index 7241ab28cf84..13173b3eaac9 100644 > --- a/fs/xfs/libxfs/xfs_trans_resv.h > +++ b/fs/xfs/libxfs/xfs_trans_resv.h > @@ -50,6 +50,8 @@ struct xfs_trans_resv { > struct xfs_trans_res tr_qm_equotaoff;/* end of turn quota off */ > struct xfs_trans_res tr_sb; /* modify superblock */ > struct xfs_trans_res tr_fsyncts; /* update timestamps on fsync */ > + struct xfs_trans_res tr_efi; /* EFI log item recovery */ > + Extra whitespace line. Brian > }; > > /* shorthand way of accessing reservation structure */ > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c > index 6cb8cd11072a..1ea9ab4cd44e 100644 > --- a/fs/xfs/xfs_extfree_item.c > +++ b/fs/xfs/xfs_extfree_item.c > @@ -618,7 +618,7 @@ xfs_efi_item_recover( > } > } > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_efi, 0, 0, 0, &tp); > if (error) > return error; > efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); > -- > 2.28.0 >