Re: [QUESTION] Upgrade xfs filesystem to reflink support?

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 11 May 2022 11:46:23 -0400

On Wed, May 11, 2022 at 08:05:23AM +1000, Dave Chinner wrote:
> On Tue, May 10, 2022 at 12:02:12PM -0700, Darrick J. Wong wrote:
> > On Tue, May 10, 2022 at 09:21:03AM +0300, Amir Goldstein wrote:
> > > On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> > > > I think the upcoming nrext64 xfsprogs patches took in the first patch in
> > > > that series.
> > > >
> > > > Question: Now that mkfs has a min logsize of 64MB, should we refuse
> > > > upgrades for any filesystem with logsize < 64MB?
> > > 
> > > I think that would make a lot of sense. We do need to reduce the upgrade
> > > test matrix as much as we can, at least as a starting point.
> > > Our customers would have started with at least 1TB fs, so should not
> > > have a problem with minimum logsize on upgrade.
> > > 
> > > BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
> > > practice of users to start with a small fs and grow it, which is encouraged by
> > > Cloud providers pricing model.
> > > 
> > > I had asked Ted about the option to resize the ext4 journal and he replied
> > > that in theory it could be done, because the ext4 journal does not need to be
> > > contiguous. He thought that it was not the case for XFS though.
> > 
> > It's theoretically possible, but I'd bet that making it work reliably
> > will be difficult for an infrequent operation.  The old log would probably
> > have to clean itself, and then write a single transaction containing
> > both the bnobt update to allocate the new log as well as an EFI to erase
> > it.  Then you write to the new log a single transaction containing the
> > superblock and an EFI to free the old log.  Then you update the primary
> > super and force it out to disk, un-quiesce the log, and finish that EFI
> > so that the old log gets freed.
> > 
> > And then you have to go back and find the necessary parts that I missed.
> 
> The new log transaction to say "the new log is over there" so log
> recovery knows that the old log is being replaced and can go find
> the new log and recover it to free the old log.
> 
> IOWs, there's a heap of log recovery work needed, a new
> intent/transaction type, futzing with feature bits because old
> kernels won't be able to recovery such a operation, etc.
> 
> Then there's interesting issues that haven't ever been considered,
> like having a discontiguity in the LSN as we physically switch logs.
> What cycle number does the new log start at? What happens to all the
> head and tail tracking fields when we switch to the new log? What
> about all the log items in the AIL which is ordered by LSN? What
> about all the active log items that track a specific LSN for
> recovery integrity purposes (e.g. inode allocation buffers)? What
> about updating the reservation grant heads that track log space
> usage? Updating all the static size calculations used by the log
> code which has to be done before the new log can be written to via
> iclogs.
> 

If XFS were going to support an online switchover of the physical log,
why not do so across a quiesce? To try and do such a thing with active
records, log items, etc. that are unrelated to the operation seems
unnecessarily complex to me.

> The allocation of the new log extent and the freeing of the old log
> extent is the easy bit. Handling the failure cases to provide an
> atomic, always recoverable switch and managing all the runtime state
> and accounting changes that are necessary is the hard part...
> 

That suggests the "hard part" of the problem is primarily the online
switchover, but is that necessarily a strict requirement for a
reasonably useful/viable feature? ISTM that even being able to increase
the size of the log offline could be quite helpful for a filesystem that
has been grown via the cloudy very small -> very large antipattern. It
not only provides a recovery path for regular end-users, but at least
gives the cloudy dev guys a step to run during image deployment to avoid
the problem.

TBH, if one were to go through the trouble of making the log resizeable,
I start to wonder whether it's worth starting with a format change that
better accommodates future flexibility. For example, the internal log is
already AG allocated space.. why not do something like assign it to an
internal log inode attached to the sb? Then the log inode has the
obvious capability to allocate or free (non-active log) extents at
runtime through all the usual codepaths without disruption because the
log itself only cares about a target device, block offset and size. We
already know a bump of the log cycle count is sufficient for consistency
across a clean mount cycle because repair has been zapping clean logs by
default as such since pretty much forever.

That potentially reduces log reallocation to a switchover algorithm that
could run at mount time. I.e., a new prospective log extent is allocated
at runtime (and maybe flagged with an xattr or something). The next
mount identifies a new/prospective log, requires/verifies that the old
log is clean, selects the new log extent (based on some currently
undefined selection algorithm) and seeds it with the appropriate cycle
count via synchronous transactions that release any currently inactive
extent(s) from the log inode. Any failure along the way sticks with the
old log and releases the still inactive new extent, if it happens to
exist. We already do this sort of stale resource clean up for other
things like unlinked inodes and stale COW blocks, so the general premise
exists.. hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>