On Mon, Aug 26, 2024 at 12:38:35PM -0700, Darrick J. Wong wrote: > On Mon, Aug 26, 2024 at 02:33:08PM +1000, Dave Chinner wrote: > > On Thu, Aug 22, 2024 at 05:25:36PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > Teach the EFI mechanism how to free realtime extents. We're going to > > > need this to enforce proper ordering of operations when we enable > > > realtime rmap. > > > > > > Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer > > > ops for rt extents. This keeps the ondisk artifacts and processing code > > > completely separate between the rt and non-rt cases. Hopefully this > > > will make it easier to debug filesystem problems. > > > > Doesn't this now require busy extent tracking for rt extents that > > are being freed? i.e. they get marked as free with the EFD, but > > cannot be reallocated (or discarded) until the EFD is committed to > > disk. > > > > we don't allow user data allocation on the data device to reuse busy > > ranges because the freeing of the extent has not yet been committed > > to the journal. Because we use async transaction commits, that means > > we can return to userspace without even the EFI in the journal - it > > can still be in memory in the CIL. Hence we cannot allow userspace > > to reallocate that range and write to it, even though it is marked free in the > > in-memory metadata. > > Ah, that's a good point -- in memory the bunmapi -> RTEFI -> RTEFD -> > rtalloc -> bmapi transactions succeed, userspace writes to the file > blocks, then the log goes down without completing /any/ of those > transactions, and now a read of the old file gets new contents. *nod* > > If userspace then does a write and then we crash without the > > original EFI on disk, then we've just violated metadata vs data > > update ordering because recovery will not replay the extent free nor > > the new allocation, yet the data in that extent will have been > > changed. > > > > Hence I think that if we are moving to intent based freeing of real > > time extents, we absolutely need to add support for busy extent > > tracking to realtime groups before we enable EFIs on realtime > > groups..... > > Yep. As a fringe benefit, we'd be able to support issuing discards from > FITRIM without holding the rtbitmap lock, and -o discard on rt extents > too. Yes. And I suspect that if we unify the perag and rtg into a single group abstraction, the busy extent tracking will work for both allocators without much functional change being needed at all... > > Also .... > > > > > @@ -447,6 +467,17 @@ xfs_extent_free_defer_add( > > > > > > trace_xfs_extent_free_defer(mp, xefi); > > > > > > + if (xfs_efi_is_realtime(xefi)) { > > > + xfs_rgnumber_t rgno; > > > + > > > + rgno = xfs_rtb_to_rgno(mp, xefi->xefi_startblock); > > > + xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno); > > > + > > > + *dfpp = xfs_defer_add(tp, &xefi->xefi_list, > > > + &xfs_rtextent_free_defer_type); > > > + return; > > > + } > > > + > > > xefi->xefi_pag = xfs_perag_intent_get(mp, xefi->xefi_startblock); > > > if (xefi->xefi_agresv == XFS_AG_RESV_AGFL) > > > *dfpp = xfs_defer_add(tp, &xefi->xefi_list, > > > > Hmmmm. Isn't this also missing the xfs_drain intent interlocks that > > allow online repair to wait until all the intents outstanding on a > > group complete? > > Yep. I forgot about that. Same comment about unified group infrastructure ;) > > > + > > > +/* Cancel a realtime extent freeing. */ > > > +STATIC void > > > +xfs_rtextent_free_cancel_item( > > > + struct list_head *item) > > > +{ > > > + struct xfs_extent_free_item *xefi = xefi_entry(item); > > > + > > > + xfs_rtgroup_put(xefi->xefi_rtg); > > > + kmem_cache_free(xfs_extfree_item_cache, xefi); > > > +} > > > + > > > +/* Process a free realtime extent. */ > > > +STATIC int > > > +xfs_rtextent_free_finish_item( > > > + struct xfs_trans *tp, > > > + struct xfs_log_item *done, > > > + struct list_head *item, > > > + struct xfs_btree_cur **state) > > > > btree cursor .... > > > > > +{ > > > + struct xfs_mount *mp = tp->t_mountp; > > > + struct xfs_extent_free_item *xefi = xefi_entry(item); > > > + struct xfs_efd_log_item *efdp = EFD_ITEM(done); > > > + struct xfs_rtgroup **rtgp = (struct xfs_rtgroup **)state; > > > > ... but is apparently holding a xfs_rtgroup. that's kinda nasty, and > > the rtg the xefi is supposed to be associated with is already held > > by the xefi, so.... > > It's very nasty, and I preferred when it was just a void**. Maybe we > should just change that to a: > > struct xfs_intent_item_state { > struct xfs_btree_cur *cur; > struct xfs_rtgroup *rtg; > }; > > and pass that around? At least then the compiler can typecheck that for > us. Sounds good to me. :) -Dave. -- Dave Chinner david@xxxxxxxxxxxxx