Re: [PATCH 3/4] xfs: create perag structures as soon as possible during log recovery

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 19 Sep 2024 11:09:40 +1000

On Wed, Sep 18, 2024 at 08:11:05AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 16, 2024 at 11:28:26AM +1000, Dave Chinner wrote:
> > I'm missing something - the intents aren't processed until the log
> > has been recovered - queuing an intent to be processed does
> > not require the per-ag to be present. We don't take per-ag
> > references until we are recovering the intent. i.e. we've completed
> > journal recovery and haven't found the corresponding EFD.
> > 
> > That leaves the EFI in the log->r_dfops, and we then run
> > ->recover_work in the second phase of recovery. It is
> > xfs_extent_free_recover_work() that creates the
> > new transaction and runs the EFI processing that requires the
> > perag references, isn't it?
> > 
> > IOWs, I don't see where the initial EFI/EFD recovery during the
> > checkpoint processing requires the newly created perags to be
> > present in memory for processing incomplete EFIs before the journal
> > recovery phase has completed.
> 
> So my new test actually blows up before creating intents:
> 
> [   81.695529] XFS (nvme1n1): Mounting V5 Filesystem 07057234-4bec-4f17-97c5-420c71c83292
> [   81.704541] XFS (nvme1n1): Starting recovery (logdev: internal)
> [   81.707260] XFS (nvme1n1): xfs_buf_map_verify: daddr 0x40003 out of range, EOFS 0x40000
> [   81.707974] ------------[ cut here ]------------
> [   81.708376] WARNING: CPU: 1 PID: 5004 at fs/xfs/xfs_buf.c:553 xfs_buf_get_map+0x8b4/0xb70
> 
> Because sb_dblocks hasn't been updated yet.

Hmmmmm. Ok, I can see how this would be an issue, but it's not the
issue the commit message describes :)

....

Oh, this is a -much worse- problem that I thought.

This is a replay for a modification to a new AGFL, and that *must*
only be replayed after the superblock modifications have been
replayed.

Ideally, we should not be using the new AGs until *after* the growfs
transaction has hit stable storage (i.e. the journal has fully
commmitted the growfs transaction), not just committed to the CIL.

If anything can access the new AGs beyond the old EOFS *before* the
growfs transaction is stable in the journal, then we have a nasty
set of race condtions where we can be making modifications to
metadata that is beyond EOFS in the journal *before* we've replayed
the superblock growfs modification.

For example:

growfs task		allocating task			log worker

xfs_trans_set_sync
xfs_trans_commit
  xlog_cil_commit
    <items added to CIL>
    xfs_trans_unreserve_and_mod_sb
      mp->m_sb.sb_agcount += tp->t_agcount_delta;
    ->iop_committing()
      xfs_buf_item_committing
        xfs_buf_item_release
          <superblock buffer unlocked>

<preempt>

			xfs_bmap_btalloc
			  xfs_bmap_btalloc_best_length
			    for_each_perag_wrap(...)
			      <sees new, uncommitted mp->m_sb.sb_agcount>
			      <selects new AG beyond EOFS in journal>
			  <does allocation in new AG beyond EOFS in journal>
			  xfs_trans_commit()
			    xlog_cil_commit()
			      <items added to CIL>
			  ....

							<log state in XLOG_STATE_COVER_NEED>
							xfs_sync_sb(wait)
							  locks sb buffer
							  xfs_trans_set_sync
							  xfs_trans_commit
							    xlog_cil_commit
							      <sb already in CIL>
							      <updates sb item order id>
							    xfs_log_force(SYNC)
							    <pushes CIL>

<runs again>
  xfs_log_force(SYNC)
  <pushes CIL>

This growfs vs allocating task race results in a checkpoint in the
journal where the new AGs are modified in the same checkpoint as the
growfs transaction. This will work if the superblock item is placed
in the journal before the items for the new AGs with your new code.

However...

.... when we add in the log worker (or some other transaction)
updating the superblock again before the CIL is pushed, we now have
a reordering problem. The CIL push will order all the items in the
CIL according to the order id attached to the log item, and this
causes the superblock item to be placed -after- the new AG items.

That will result in the exact recovery error you quoted above,
regardless of the fixes in this series.

<thinks about how to fix it>

I think the first step in avoiding this is ensuring we can't relog
the superblock until the growfs transaction is on disk. We can
do that by explicitly grabbing the superblock buffer and holding it
before we commit the transaction so the superblock buffer remains
locked until xfs_trans_commit() returns. That will prevent races
with other sb updates by blocking them on the sb buffer lock.

The second step is preventing allocations that are running from
seeing the mp->m_sb.sb_agcount update until after the transaction is
stable.

Ok, xfs_trans_mod_sb(XFS_TRANS_SB_AGCOUNT) is only used by the
grwofs code and the first step above means we have explicitly locked
the sb buffer. Hence the whole xfs_trans_mod_sb() -> delta in trans
-> xfs_trans_commit -> xfs_trans_apply_sb_deltas() -> lock sb buffer,
modify sb and log it -> xlog_cil_commit() song and dance routine can
go away.

i.e. We can modify the superblock buffer directly in the growfs
code, and then when xfs_trans_commit() returns we directly update
the in-memory superblock before unlocking the superblock buffer.

This means that nothing can update the superblock before the growfs
transaction is stable in the journal, and nothing at runtime will
see that there are new AGs or space available until after the on
disk state has changed.

This then allows recovery to be able to update the superblock and
perag state after checkpoint processing is complete. All future
checkpoint recoveries will now be guaranteed to see the correct
superblock state, whilst the checkpoint the growfs update is in
is guaranteed to only be dependent on the old superblock state...

I suspect that we might have to move all superblock updates to this
model - we don't update the in-memory state until after the on-disk
update is on stable storage in the journal - that will avoid the
possibility of racing/reordered feature bit update being replayed,
too.

The only issue with this is that I have a nagging memory of holding
the superblock buffer locked across a synchronous transaction could
deadlock the log force in some way. I can't find a reference to that
situation anywhere - maybe it wasn't a real issue, or someone else
remembers that situation better than I do....

> I'd kinda assume we run
> into the intents next, but maybe we don't.

I don't think intents are the problem because they are run after
the superblock and perags are updated.

> > > +	xfs_sb_from_disk(&mp->m_sb, dsb);
> > > +	if (mp->m_sb.sb_agcount < old_agcount) {
> > > +		xfs_alert(mp, "Shrinking AG count in log recovery");
> > > +		return -EFSCORRUPTED;
> > > +	}
> > > +	mp->m_features |= xfs_sb_version_to_features(&mp->m_sb);
> > 
> > I'm not sure this is safe. The item order in the checkpoint recovery
> > isn't guaranteed to be exactly the same as when feature bits are
> > modified at runtime. Hence there could be items in the checkpoint
> > that haven't yet been recovered that are dependent on the original
> > sb feature mask being present.  It may be OK to do this at the end
> > of the checkpoint being recovered.
> > 
> > I'm also not sure why this feature update code is being changed
> > because it's not mentioned at all in the commit message.
> 
> Mostly to keep the features in sync with the in-memory sb fields
> updated above.  I'll switch to keep this as-is, but I fail to see how
> updating features only after the entire reocvery is done will be safe
> for all cases either.
> 
> Where would we depend on the old feature setting?

One possibility is a similar reordering race to what I describe
above; another is simply that the superblock feature update is not
serialised at runtime with anything that checks that feature. Hence
the order of modifications in the journal may not reflect the order
in which the checks and modifications were done at runtime....

Hence I'm not convinced that it is safe to update superblock state
in the middle of a checkpoint - between checkpoints (i.e. after
checkpoint commit record processing) can be made safe (as per
above), but the relogging of items in the CIL makes mid-checkpoint
updates somewhat ... problematic.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx