On Mon, Aug 26, 2024 at 02:33:08PM +1000, Dave Chinner wrote: > On Thu, Aug 22, 2024 at 05:25:36PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Teach the EFI mechanism how to free realtime extents. We're going to > > need this to enforce proper ordering of operations when we enable > > realtime rmap. > > > > Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer > > ops for rt extents. This keeps the ondisk artifacts and processing code > > completely separate between the rt and non-rt cases. Hopefully this > > will make it easier to debug filesystem problems. > > Doesn't this now require busy extent tracking for rt extents that > are being freed? i.e. they get marked as free with the EFD, but > cannot be reallocated (or discarded) until the EFD is committed to > disk. > > we don't allow user data allocation on the data device to reuse busy > ranges because the freeing of the extent has not yet been committed > to the journal. Because we use async transaction commits, that means > we can return to userspace without even the EFI in the journal - it > can still be in memory in the CIL. Hence we cannot allow userspace > to reallocate that range and write to it, even though it is marked free in the > in-memory metadata. Ah, that's a good point -- in memory the bunmapi -> RTEFI -> RTEFD -> rtalloc -> bmapi transactions succeed, userspace writes to the file blocks, then the log goes down without completing /any/ of those transactions, and now a read of the old file gets new contents. > If userspace then does a write and then we crash without the > original EFI on disk, then we've just violated metadata vs data > update ordering because recovery will not replay the extent free nor > the new allocation, yet the data in that extent will have been > changed. > > Hence I think that if we are moving to intent based freeing of real > time extents, we absolutely need to add support for busy extent > tracking to realtime groups before we enable EFIs on realtime > groups..... Yep. As a fringe benefit, we'd be able to support issuing discards from FITRIM without holding the rtbitmap lock, and -o discard on rt extents too. > Also .... > > > @@ -447,6 +467,17 @@ xfs_extent_free_defer_add( > > > > trace_xfs_extent_free_defer(mp, xefi); > > > > + if (xfs_efi_is_realtime(xefi)) { > > + xfs_rgnumber_t rgno; > > + > > + rgno = xfs_rtb_to_rgno(mp, xefi->xefi_startblock); > > + xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno); > > + > > + *dfpp = xfs_defer_add(tp, &xefi->xefi_list, > > + &xfs_rtextent_free_defer_type); > > + return; > > + } > > + > > xefi->xefi_pag = xfs_perag_intent_get(mp, xefi->xefi_startblock); > > if (xefi->xefi_agresv == XFS_AG_RESV_AGFL) > > *dfpp = xfs_defer_add(tp, &xefi->xefi_list, > > Hmmmm. Isn't this also missing the xfs_drain intent interlocks that > allow online repair to wait until all the intents outstanding on a > group complete? Yep. I forgot about that. > > @@ -687,6 +735,106 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = { > > .relog_intent = xfs_extent_free_relog_intent, > > }; > > > > +#ifdef CONFIG_XFS_RT > > +/* Sort realtime efi items by rtgroup for efficiency. */ > > +static int > > +xfs_rtextent_free_diff_items( > > + void *priv, > > + const struct list_head *a, > > + const struct list_head *b) > > +{ > > + struct xfs_extent_free_item *ra = xefi_entry(a); > > + struct xfs_extent_free_item *rb = xefi_entry(b); > > + > > + return ra->xefi_rtg->rtg_rgno - rb->xefi_rtg->rtg_rgno; > > +} > > + > > +/* Create a realtime extent freeing */ > > +static struct xfs_log_item * > > +xfs_rtextent_free_create_intent( > > + struct xfs_trans *tp, > > + struct list_head *items, > > + unsigned int count, > > + bool sort) > > +{ > > + struct xfs_mount *mp = tp->t_mountp; > > + struct xfs_efi_log_item *efip; > > + struct xfs_extent_free_item *xefi; > > + > > + ASSERT(count > 0); > > + > > + efip = xfs_efi_init(mp, XFS_LI_EFI_RT, count); > > + if (sort) > > + list_sort(mp, items, xfs_rtextent_free_diff_items); > > + list_for_each_entry(xefi, items, xefi_list) > > + xfs_extent_free_log_item(tp, efip, xefi); > > + return &efip->efi_item; > > +} > > Hmmmm - when would we get an XFS_LI_EFI_RT with multiple extents in > it? We only ever free a single user data extent per transaction at a > time, right? There will be no metadata blocks being freed on the rt > device - all the BMBT, refcountbt and rmapbt blocks that get freed > as a result of freeing the user data extent will be in the data > device and so will use EFIs, not EFI_RTs.... Later on when we get to reflink, a refcount decrement operation on an extent that has a mix of single and multiple-owned blocks can generate RTEFIs with multiple extents. > > + > > +/* Cancel a realtime extent freeing. */ > > +STATIC void > > +xfs_rtextent_free_cancel_item( > > + struct list_head *item) > > +{ > > + struct xfs_extent_free_item *xefi = xefi_entry(item); > > + > > + xfs_rtgroup_put(xefi->xefi_rtg); > > + kmem_cache_free(xfs_extfree_item_cache, xefi); > > +} > > + > > +/* Process a free realtime extent. */ > > +STATIC int > > +xfs_rtextent_free_finish_item( > > + struct xfs_trans *tp, > > + struct xfs_log_item *done, > > + struct list_head *item, > > + struct xfs_btree_cur **state) > > btree cursor .... > > > +{ > > + struct xfs_mount *mp = tp->t_mountp; > > + struct xfs_extent_free_item *xefi = xefi_entry(item); > > + struct xfs_efd_log_item *efdp = EFD_ITEM(done); > > + struct xfs_rtgroup **rtgp = (struct xfs_rtgroup **)state; > > ... but is apparently holding a xfs_rtgroup. that's kinda nasty, and > the rtg the xefi is supposed to be associated with is already held > by the xefi, so.... It's very nasty, and I preferred when it was just a void**. Maybe we should just change that to a: struct xfs_intent_item_state { struct xfs_btree_cur *cur; struct xfs_rtgroup *rtg; }; and pass that around? At least then the compiler can typecheck that for us. > > + int error = 0; > > + > > + trace_xfs_extent_free_deferred(mp, xefi); > > + > > + if (!(xefi->xefi_flags & XFS_EFI_CANCELLED)) { > > + if (*rtgp != xefi->xefi_rtg) { > > + xfs_rtgroup_lock(xefi->xefi_rtg, XFS_RTGLOCK_BITMAP); > > + xfs_rtgroup_trans_join(tp, xefi->xefi_rtg, > > + XFS_RTGLOCK_BITMAP); > > + *rtgp = xefi->xefi_rtg; > > How does this case happen? Why is it safe to lock the xefi rtg > here, and why are we returning the xefi rtg to the caller without > taking extra references or dropping the rtg the caller passed in? > > At least a comment explaining what is happening is necessary here... Hmm, I wonder when /is/ this possible? I don't think it can actually happen ... except maybe in the case of a bunmapi where we pass in a large bmbt_irec array? Let me investigate... The locks and ijoins will be dropped at transaction commit. > > + } > > + error = xfs_rtfree_blocks(tp, xefi->xefi_rtg, > > + xefi->xefi_startblock, xefi->xefi_blockcount); > > + } > > + if (error == -EAGAIN) { > > + xfs_efd_from_efi(efdp); > > + return error; > > + } > > + > > + xfs_efd_add_extent(efdp, xefi); > > + xfs_rtextent_free_cancel_item(item); > > + return error; > > +} > > + > > +const struct xfs_defer_op_type xfs_rtextent_free_defer_type = { > > + .name = "rtextent_free", > > + .max_items = XFS_EFI_MAX_FAST_EXTENTS, > > + .create_intent = xfs_rtextent_free_create_intent, > > + .abort_intent = xfs_extent_free_abort_intent, > > + .create_done = xfs_extent_free_create_done, > > + .finish_item = xfs_rtextent_free_finish_item, > > + .cancel_item = xfs_rtextent_free_cancel_item, > > + .recover_work = xfs_extent_free_recover_work, > > + .relog_intent = xfs_extent_free_relog_intent, > > +}; > > +#else > > +const struct xfs_defer_op_type xfs_rtextent_free_defer_type = { > > + .name = "rtextent_free", > > +}; > > +#endif /* CONFIG_XFS_RT */ > > + > > STATIC bool > > xfs_efi_item_match( > > struct xfs_log_item *lip, > > @@ -731,7 +879,7 @@ xlog_recover_efi_commit_pass2( > > return -EFSCORRUPTED; > > } > > > > - efip = xfs_efi_init(mp, efi_formatp->efi_nextents); > > + efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents); > > error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format); > > if (error) { > > xfs_efi_item_free(efip); > > @@ -749,6 +897,58 @@ const struct xlog_recover_item_ops xlog_efi_item_ops = { > > .commit_pass2 = xlog_recover_efi_commit_pass2, > > }; > > > > +#ifdef CONFIG_XFS_RT > > +STATIC int > > +xlog_recover_rtefi_commit_pass2( > > + struct xlog *log, > > + struct list_head *buffer_list, > > + struct xlog_recover_item *item, > > + xfs_lsn_t lsn) > > +{ > > + struct xfs_mount *mp = log->l_mp; > > + struct xfs_efi_log_item *efip; > > + struct xfs_efi_log_format *efi_formatp; > > + int error; > > + > > + efi_formatp = item->ri_buf[0].i_addr; > > + > > + if (item->ri_buf[0].i_len < xfs_efi_log_format_sizeof(0)) { > > + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, > > + item->ri_buf[0].i_addr, item->ri_buf[0].i_len); > > + return -EFSCORRUPTED; > > + } > > + > > + efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents); > > + error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format); > > + if (error) { > > + xfs_efi_item_free(efip); > > + return error; > > + } > > + atomic_set(&efip->efi_next_extent, efi_formatp->efi_nextents); > > + > > + xlog_recover_intent_item(log, &efip->efi_item, lsn, > > + &xfs_rtextent_free_defer_type); > > + return 0; > > +} > > +#else > > +STATIC int > > +xlog_recover_rtefi_commit_pass2( > > + struct xlog *log, > > + struct list_head *buffer_list, > > + struct xlog_recover_item *item, > > + xfs_lsn_t lsn) > > +{ > > + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, log->l_mp, > > + item->ri_buf[0].i_addr, item->ri_buf[0].i_len); > > + return -EFSCORRUPTED; > > This needs to be a more meaningful error. It's not technically a > corruption - we recognised that an RTEFI is needing to be recovered, > but this kernel does not have RTEFI support compiled in. Hence the > error should be something along the lines of > > "RTEFI found in journal, but kernel not compiled with CONFIG_XFS_RT enabled. > Cannot recover journal, please remount using a kernel with RT device > support enabled." Ok. That should probably get applied to the RTRUI and RTCUI recovery stubs too. --D > -Dave. > > -- > Dave Chinner > david@xxxxxxxxxxxxx >