On Thu, Aug 22, 2024 at 05:25:36PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Teach the EFI mechanism how to free realtime extents. We're going to > need this to enforce proper ordering of operations when we enable > realtime rmap. > > Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer > ops for rt extents. This keeps the ondisk artifacts and processing code > completely separate between the rt and non-rt cases. Hopefully this > will make it easier to debug filesystem problems. Doesn't this now require busy extent tracking for rt extents that are being freed? i.e. they get marked as free with the EFD, but cannot be reallocated (or discarded) until the EFD is committed to disk. we don't allow user data allocation on the data device to reuse busy ranges because the freeing of the extent has not yet been committed to the journal. Because we use async transaction commits, that means we can return to userspace without even the EFI in the journal - it can still be in memory in the CIL. Hence we cannot allow userspace to reallocate that range and write to it, even though it is marked free in the in-memory metadata. If userspace then does a write and then we crash without the original EFI on disk, then we've just violated metadata vs data update ordering because recovery will not replay the extent free nor the new allocation, yet the data in that extent will have been changed. Hence I think that if we are moving to intent based freeing of real time extents, we absolutely need to add support for busy extent tracking to realtime groups before we enable EFIs on realtime groups..... Also .... > @@ -447,6 +467,17 @@ xfs_extent_free_defer_add( > > trace_xfs_extent_free_defer(mp, xefi); > > + if (xfs_efi_is_realtime(xefi)) { > + xfs_rgnumber_t rgno; > + > + rgno = xfs_rtb_to_rgno(mp, xefi->xefi_startblock); > + xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno); > + > + *dfpp = xfs_defer_add(tp, &xefi->xefi_list, > + &xfs_rtextent_free_defer_type); > + return; > + } > + > xefi->xefi_pag = xfs_perag_intent_get(mp, xefi->xefi_startblock); > if (xefi->xefi_agresv == XFS_AG_RESV_AGFL) > *dfpp = xfs_defer_add(tp, &xefi->xefi_list, Hmmmm. Isn't this also missing the xfs_drain intent interlocks that allow online repair to wait until all the intents outstanding on a group complete? > @@ -687,6 +735,106 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = { > .relog_intent = xfs_extent_free_relog_intent, > }; > > +#ifdef CONFIG_XFS_RT > +/* Sort realtime efi items by rtgroup for efficiency. */ > +static int > +xfs_rtextent_free_diff_items( > + void *priv, > + const struct list_head *a, > + const struct list_head *b) > +{ > + struct xfs_extent_free_item *ra = xefi_entry(a); > + struct xfs_extent_free_item *rb = xefi_entry(b); > + > + return ra->xefi_rtg->rtg_rgno - rb->xefi_rtg->rtg_rgno; > +} > + > +/* Create a realtime extent freeing */ > +static struct xfs_log_item * > +xfs_rtextent_free_create_intent( > + struct xfs_trans *tp, > + struct list_head *items, > + unsigned int count, > + bool sort) > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_efi_log_item *efip; > + struct xfs_extent_free_item *xefi; > + > + ASSERT(count > 0); > + > + efip = xfs_efi_init(mp, XFS_LI_EFI_RT, count); > + if (sort) > + list_sort(mp, items, xfs_rtextent_free_diff_items); > + list_for_each_entry(xefi, items, xefi_list) > + xfs_extent_free_log_item(tp, efip, xefi); > + return &efip->efi_item; > +} Hmmmm - when would we get an XFS_LI_EFI_RT with multiple extents in it? We only ever free a single user data extent per transaction at a time, right? There will be no metadata blocks being freed on the rt device - all the BMBT, refcountbt and rmapbt blocks that get freed as a result of freeing the user data extent will be in the data device and so will use EFIs, not EFI_RTs.... > + > +/* Cancel a realtime extent freeing. */ > +STATIC void > +xfs_rtextent_free_cancel_item( > + struct list_head *item) > +{ > + struct xfs_extent_free_item *xefi = xefi_entry(item); > + > + xfs_rtgroup_put(xefi->xefi_rtg); > + kmem_cache_free(xfs_extfree_item_cache, xefi); > +} > + > +/* Process a free realtime extent. */ > +STATIC int > +xfs_rtextent_free_finish_item( > + struct xfs_trans *tp, > + struct xfs_log_item *done, > + struct list_head *item, > + struct xfs_btree_cur **state) btree cursor .... > +{ > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_extent_free_item *xefi = xefi_entry(item); > + struct xfs_efd_log_item *efdp = EFD_ITEM(done); > + struct xfs_rtgroup **rtgp = (struct xfs_rtgroup **)state; ... but is apparently holding a xfs_rtgroup. that's kinda nasty, and the rtg the xefi is supposed to be associated with is already held by the xefi, so.... > + int error = 0; > + > + trace_xfs_extent_free_deferred(mp, xefi); > + > + if (!(xefi->xefi_flags & XFS_EFI_CANCELLED)) { > + if (*rtgp != xefi->xefi_rtg) { > + xfs_rtgroup_lock(xefi->xefi_rtg, XFS_RTGLOCK_BITMAP); > + xfs_rtgroup_trans_join(tp, xefi->xefi_rtg, > + XFS_RTGLOCK_BITMAP); > + *rtgp = xefi->xefi_rtg; How does this case happen? Why is it safe to lock the xefi rtg here, and why are we returning the xefi rtg to the caller without taking extra references or dropping the rtg the caller passed in? At least a comment explaining what is happening is necessary here... > + } > + error = xfs_rtfree_blocks(tp, xefi->xefi_rtg, > + xefi->xefi_startblock, xefi->xefi_blockcount); > + } > + if (error == -EAGAIN) { > + xfs_efd_from_efi(efdp); > + return error; > + } > + > + xfs_efd_add_extent(efdp, xefi); > + xfs_rtextent_free_cancel_item(item); > + return error; > +} > + > +const struct xfs_defer_op_type xfs_rtextent_free_defer_type = { > + .name = "rtextent_free", > + .max_items = XFS_EFI_MAX_FAST_EXTENTS, > + .create_intent = xfs_rtextent_free_create_intent, > + .abort_intent = xfs_extent_free_abort_intent, > + .create_done = xfs_extent_free_create_done, > + .finish_item = xfs_rtextent_free_finish_item, > + .cancel_item = xfs_rtextent_free_cancel_item, > + .recover_work = xfs_extent_free_recover_work, > + .relog_intent = xfs_extent_free_relog_intent, > +}; > +#else > +const struct xfs_defer_op_type xfs_rtextent_free_defer_type = { > + .name = "rtextent_free", > +}; > +#endif /* CONFIG_XFS_RT */ > + > STATIC bool > xfs_efi_item_match( > struct xfs_log_item *lip, > @@ -731,7 +879,7 @@ xlog_recover_efi_commit_pass2( > return -EFSCORRUPTED; > } > > - efip = xfs_efi_init(mp, efi_formatp->efi_nextents); > + efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents); > error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format); > if (error) { > xfs_efi_item_free(efip); > @@ -749,6 +897,58 @@ const struct xlog_recover_item_ops xlog_efi_item_ops = { > .commit_pass2 = xlog_recover_efi_commit_pass2, > }; > > +#ifdef CONFIG_XFS_RT > +STATIC int > +xlog_recover_rtefi_commit_pass2( > + struct xlog *log, > + struct list_head *buffer_list, > + struct xlog_recover_item *item, > + xfs_lsn_t lsn) > +{ > + struct xfs_mount *mp = log->l_mp; > + struct xfs_efi_log_item *efip; > + struct xfs_efi_log_format *efi_formatp; > + int error; > + > + efi_formatp = item->ri_buf[0].i_addr; > + > + if (item->ri_buf[0].i_len < xfs_efi_log_format_sizeof(0)) { > + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, > + item->ri_buf[0].i_addr, item->ri_buf[0].i_len); > + return -EFSCORRUPTED; > + } > + > + efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents); > + error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format); > + if (error) { > + xfs_efi_item_free(efip); > + return error; > + } > + atomic_set(&efip->efi_next_extent, efi_formatp->efi_nextents); > + > + xlog_recover_intent_item(log, &efip->efi_item, lsn, > + &xfs_rtextent_free_defer_type); > + return 0; > +} > +#else > +STATIC int > +xlog_recover_rtefi_commit_pass2( > + struct xlog *log, > + struct list_head *buffer_list, > + struct xlog_recover_item *item, > + xfs_lsn_t lsn) > +{ > + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, log->l_mp, > + item->ri_buf[0].i_addr, item->ri_buf[0].i_len); > + return -EFSCORRUPTED; This needs to be a more meaningful error. It's not technically a corruption - we recognised that an RTEFI is needing to be recovered, but this kernel does not have RTEFI support compiled in. Hence the error should be something along the lines of "RTEFI found in journal, but kernel not compiled with CONFIG_XFS_RT enabled. Cannot recover journal, please remount using a kernel with RT device support enabled." -Dave. -- Dave Chinner david@xxxxxxxxxxxxx