On Wed, Sep 18, 2024 at 08:11:05AM +0200, Christoph Hellwig wrote: > On Mon, Sep 16, 2024 at 11:28:26AM +1000, Dave Chinner wrote: > > I'm missing something - the intents aren't processed until the log > > has been recovered - queuing an intent to be processed does > > not require the per-ag to be present. We don't take per-ag > > references until we are recovering the intent. i.e. we've completed > > journal recovery and haven't found the corresponding EFD. > > > > That leaves the EFI in the log->r_dfops, and we then run > > ->recover_work in the second phase of recovery. It is > > xfs_extent_free_recover_work() that creates the > > new transaction and runs the EFI processing that requires the > > perag references, isn't it? > > > > IOWs, I don't see where the initial EFI/EFD recovery during the > > checkpoint processing requires the newly created perags to be > > present in memory for processing incomplete EFIs before the journal > > recovery phase has completed. > > So my new test actually blows up before creating intents: > > [ 81.695529] XFS (nvme1n1): Mounting V5 Filesystem 07057234-4bec-4f17-97c5-420c71c83292 > [ 81.704541] XFS (nvme1n1): Starting recovery (logdev: internal) > [ 81.707260] XFS (nvme1n1): xfs_buf_map_verify: daddr 0x40003 out of range, EOFS 0x40000 > [ 81.707974] ------------[ cut here ]------------ > [ 81.708376] WARNING: CPU: 1 PID: 5004 at fs/xfs/xfs_buf.c:553 xfs_buf_get_map+0x8b4/0xb70 > > Because sb_dblocks hasn't been updated yet. Hmmmmm. Ok, I can see how this would be an issue, but it's not the issue the commit message describes :) .... Oh, this is a -much worse- problem that I thought. This is a replay for a modification to a new AGFL, and that *must* only be replayed after the superblock modifications have been replayed. Ideally, we should not be using the new AGs until *after* the growfs transaction has hit stable storage (i.e. the journal has fully commmitted the growfs transaction), not just committed to the CIL. If anything can access the new AGs beyond the old EOFS *before* the growfs transaction is stable in the journal, then we have a nasty set of race condtions where we can be making modifications to metadata that is beyond EOFS in the journal *before* we've replayed the superblock growfs modification. For example: growfs task allocating task log worker xfs_trans_set_sync xfs_trans_commit xlog_cil_commit <items added to CIL> xfs_trans_unreserve_and_mod_sb mp->m_sb.sb_agcount += tp->t_agcount_delta; ->iop_committing() xfs_buf_item_committing xfs_buf_item_release <superblock buffer unlocked> <preempt> xfs_bmap_btalloc xfs_bmap_btalloc_best_length for_each_perag_wrap(...) <sees new, uncommitted mp->m_sb.sb_agcount> <selects new AG beyond EOFS in journal> <does allocation in new AG beyond EOFS in journal> xfs_trans_commit() xlog_cil_commit() <items added to CIL> .... <log state in XLOG_STATE_COVER_NEED> xfs_sync_sb(wait) locks sb buffer xfs_trans_set_sync xfs_trans_commit xlog_cil_commit <sb already in CIL> <updates sb item order id> xfs_log_force(SYNC) <pushes CIL> <runs again> xfs_log_force(SYNC) <pushes CIL> This growfs vs allocating task race results in a checkpoint in the journal where the new AGs are modified in the same checkpoint as the growfs transaction. This will work if the superblock item is placed in the journal before the items for the new AGs with your new code. However... .... when we add in the log worker (or some other transaction) updating the superblock again before the CIL is pushed, we now have a reordering problem. The CIL push will order all the items in the CIL according to the order id attached to the log item, and this causes the superblock item to be placed -after- the new AG items. That will result in the exact recovery error you quoted above, regardless of the fixes in this series. <thinks about how to fix it> I think the first step in avoiding this is ensuring we can't relog the superblock until the growfs transaction is on disk. We can do that by explicitly grabbing the superblock buffer and holding it before we commit the transaction so the superblock buffer remains locked until xfs_trans_commit() returns. That will prevent races with other sb updates by blocking them on the sb buffer lock. The second step is preventing allocations that are running from seeing the mp->m_sb.sb_agcount update until after the transaction is stable. Ok, xfs_trans_mod_sb(XFS_TRANS_SB_AGCOUNT) is only used by the grwofs code and the first step above means we have explicitly locked the sb buffer. Hence the whole xfs_trans_mod_sb() -> delta in trans -> xfs_trans_commit -> xfs_trans_apply_sb_deltas() -> lock sb buffer, modify sb and log it -> xlog_cil_commit() song and dance routine can go away. i.e. We can modify the superblock buffer directly in the growfs code, and then when xfs_trans_commit() returns we directly update the in-memory superblock before unlocking the superblock buffer. This means that nothing can update the superblock before the growfs transaction is stable in the journal, and nothing at runtime will see that there are new AGs or space available until after the on disk state has changed. This then allows recovery to be able to update the superblock and perag state after checkpoint processing is complete. All future checkpoint recoveries will now be guaranteed to see the correct superblock state, whilst the checkpoint the growfs update is in is guaranteed to only be dependent on the old superblock state... I suspect that we might have to move all superblock updates to this model - we don't update the in-memory state until after the on-disk update is on stable storage in the journal - that will avoid the possibility of racing/reordered feature bit update being replayed, too. The only issue with this is that I have a nagging memory of holding the superblock buffer locked across a synchronous transaction could deadlock the log force in some way. I can't find a reference to that situation anywhere - maybe it wasn't a real issue, or someone else remembers that situation better than I do.... > I'd kinda assume we run > into the intents next, but maybe we don't. I don't think intents are the problem because they are run after the superblock and perags are updated. > > > + xfs_sb_from_disk(&mp->m_sb, dsb); > > > + if (mp->m_sb.sb_agcount < old_agcount) { > > > + xfs_alert(mp, "Shrinking AG count in log recovery"); > > > + return -EFSCORRUPTED; > > > + } > > > + mp->m_features |= xfs_sb_version_to_features(&mp->m_sb); > > > > I'm not sure this is safe. The item order in the checkpoint recovery > > isn't guaranteed to be exactly the same as when feature bits are > > modified at runtime. Hence there could be items in the checkpoint > > that haven't yet been recovered that are dependent on the original > > sb feature mask being present. It may be OK to do this at the end > > of the checkpoint being recovered. > > > > I'm also not sure why this feature update code is being changed > > because it's not mentioned at all in the commit message. > > Mostly to keep the features in sync with the in-memory sb fields > updated above. I'll switch to keep this as-is, but I fail to see how > updating features only after the entire reocvery is done will be safe > for all cases either. > > Where would we depend on the old feature setting? One possibility is a similar reordering race to what I describe above; another is simply that the superblock feature update is not serialised at runtime with anything that checks that feature. Hence the order of modifications in the journal may not reflect the order in which the checks and modifications were done at runtime.... Hence I'm not convinced that it is safe to update superblock state in the middle of a checkpoint - between checkpoints (i.e. after checkpoint commit record processing) can be made safe (as per above), but the relogging of items in the CIL makes mid-checkpoint updates somewhat ... problematic. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx