Re: [PATCH 17/26] xfs: support logging EFIs for realtime extents

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 27 Aug 2024 11:36:57 +1000

On Mon, Aug 26, 2024 at 12:38:35PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 26, 2024 at 02:33:08PM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2024 at 05:25:36PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > 
> > > Teach the EFI mechanism how to free realtime extents.  We're going to
> > > need this to enforce proper ordering of operations when we enable
> > > realtime rmap.
> > > 
> > > Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer
> > > ops for rt extents.  This keeps the ondisk artifacts and processing code
> > > completely separate between the rt and non-rt cases.  Hopefully this
> > > will make it easier to debug filesystem problems.
> > 
> > Doesn't this now require busy extent tracking for rt extents that
> > are being freed?  i.e. they get marked as free with the EFD, but
> > cannot be reallocated (or discarded) until the EFD is committed to
> > disk.
> > 
> > we don't allow user data allocation on the data device to reuse busy
> > ranges because the freeing of the extent has not yet been committed
> > to the journal. Because we use async transaction commits, that means
> > we can return to userspace without even the EFI in the journal - it
> > can still be in memory in the CIL. Hence we cannot allow userspace
> > to reallocate that range and write to it, even though it is marked free in the
> > in-memory metadata.
> 
> Ah, that's a good point -- in memory the bunmapi -> RTEFI -> RTEFD ->
> rtalloc -> bmapi transactions succeed, userspace writes to the file
> blocks, then the log goes down without completing /any/ of those
> transactions, and now a read of the old file gets new contents.

*nod*

> > If userspace then does a write and then we crash without the
> > original EFI on disk, then we've just violated metadata vs data
> > update ordering because recovery will not replay the extent free nor
> > the new allocation, yet the data in that extent will have been
> > changed.
> > 
> > Hence I think that if we are moving to intent based freeing of real
> > time extents, we absolutely need to add support for busy extent
> > tracking to realtime groups before we enable EFIs on realtime
> > groups.....
> 
> Yep.  As a fringe benefit, we'd be able to support issuing discards from
> FITRIM without holding the rtbitmap lock, and -o discard on rt extents
> too.

Yes. And I suspect that if we unify the perag and rtg into a single
group abstraction, the busy extent tracking will work for both
allocators without much functional change being needed at all...

> > Also ....
> > 
> > > @@ -447,6 +467,17 @@ xfs_extent_free_defer_add(
> > >  
> > >  	trace_xfs_extent_free_defer(mp, xefi);
> > >  
> > > +	if (xfs_efi_is_realtime(xefi)) {
> > > +		xfs_rgnumber_t		rgno;
> > > +
> > > +		rgno = xfs_rtb_to_rgno(mp, xefi->xefi_startblock);
> > > +		xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno);
> > > +
> > > +		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
> > > +				&xfs_rtextent_free_defer_type);
> > > +		return;
> > > +	}
> > > +
> > >  	xefi->xefi_pag = xfs_perag_intent_get(mp, xefi->xefi_startblock);
> > >  	if (xefi->xefi_agresv == XFS_AG_RESV_AGFL)
> > >  		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
> > 
> > Hmmmm. Isn't this also missing the xfs_drain intent interlocks that
> > allow online repair to wait until all the intents outstanding on a
> > group complete?
> 
> Yep.  I forgot about that.

Same comment about unified group infrastructure ;)

> > > +
> > > +/* Cancel a realtime extent freeing. */
> > > +STATIC void
> > > +xfs_rtextent_free_cancel_item(
> > > +	struct list_head		*item)
> > > +{
> > > +	struct xfs_extent_free_item	*xefi = xefi_entry(item);
> > > +
> > > +	xfs_rtgroup_put(xefi->xefi_rtg);
> > > +	kmem_cache_free(xfs_extfree_item_cache, xefi);
> > > +}
> > > +
> > > +/* Process a free realtime extent. */
> > > +STATIC int
> > > +xfs_rtextent_free_finish_item(
> > > +	struct xfs_trans		*tp,
> > > +	struct xfs_log_item		*done,
> > > +	struct list_head		*item,
> > > +	struct xfs_btree_cur		**state)
> > 
> > btree cursor ....
> > 
> > > +{
> > > +	struct xfs_mount		*mp = tp->t_mountp;
> > > +	struct xfs_extent_free_item	*xefi = xefi_entry(item);
> > > +	struct xfs_efd_log_item		*efdp = EFD_ITEM(done);
> > > +	struct xfs_rtgroup		**rtgp = (struct xfs_rtgroup **)state;
> > 
> > ... but is apparently holding a xfs_rtgroup. that's kinda nasty, and
> > the rtg the xefi is supposed to be associated with is already held
> > by the xefi, so....
> 
> It's very nasty, and I preferred when it was just a void**.  Maybe we
> should just change that to a:
> 
> struct xfs_intent_item_state {
> 	struct xfs_btree_cur	*cur;
> 	struct xfs_rtgroup	*rtg;
> };
> 
> and pass that around?  At least then the compiler can typecheck that for
> us.

Sounds good to me. :)

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx