On Tue, Feb 23, 2021 at 07:31:06AM -0500, Brian Foster wrote: > On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote: > > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote: > > > Freed extents are marked busy from the point the freeing transaction > > > commits until the associated CIL context is checkpointed to the log. > > > This prevents reuse and overwrite of recently freed blocks before > > > the changes are committed to disk, which can lead to corruption > > > after a crash. The exception to this rule is that metadata > > > allocation is allowed to reuse busy extents because metadata changes > > > are also logged. > > > > > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS > > > has allowed modification or complete invalidation of outstanding > > > busy extents for metadata allocations. This implementation assumes > > > that use of the associated extent is imminent, which is not always > > > the case. For example, the trimmed extent might not satisfy the > > > minimum length of the allocation request, or the allocation > > > algorithm might be involved in a search for the optimal result based > > > on locality. > > > > > > generic/019 reproduces a corruption caused by this scenario. First, > > > a metadata block (usually a bmbt or symlink block) is freed from an > > > inode. A subsequent bmbt split on an unrelated inode attempts a near > > > mode allocation request that invalidates the busy block during the > > > search, but does not ultimately allocate it. Due to the busy state > > > invalidation, the block is no longer considered busy to subsequent > > > allocation. A direct I/O write request immediately allocates the > > > block and writes to it. > > > > I really hope there's a fstest case coming for this... :) > > > > generic/019? :) I'm not sure of a good way to reproduce on demand given > the conditions required to reproduce. <nod> I guess you'd have to have a fs where extents take a long time to exit the busy tree, and then set up the allocations and frees just right. FWIW I've never hit this in generic/019. > > > Finally, the filesystem crashes while in a > > > state where the initial metadata block free had not committed to the > > > on-disk log. After recovery, the original metadata block is in its > > > original location as expected, but has been corrupted by the > > > aforementioned dio. > > > > Wheee! > > > > Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go > > find a freespace record, call xfs_extent_busy_trim (which could erase > > the busy extent entry), decide that it's not interested after all, and > > bail out without restoring the busy entry. > > > > Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as > > we wander around the free space btrees looking for a good chunk of > > space... and doesn't restore the busy record if it decides to consider a > > different extent. > > > > Yep. I was originally curious whether the more recent allocator rework > introduced this problem somehow, but AFAICT that just refactored the > relevant allocator code and this bug has been latent in the existing > code for quite some time. That's not hugely surprising given the rare > combination of conditions required to reproduce. > > > So I guess this "speculatively remove busy records and forget to restore > > them" behavior opens the door to the write allocating blocks that aren't > > yet free and nonbusy, right? And the solution presented here is to > > avoid letting go of the busy record for the bmbt allocation, and if the > > btree split caller decides it really /must/ have that block for the bmbt > > it can force the log and try again, just like we do for a file data > > allocation? > > > > Yes, pretty much. The metadata allocation that is allowed to safely > reuse busy extents ends up invalidating a set of blocks during a NEAR > mode search (i.e. bmbt allocation), but ends up only using one of those > blocks. A data allocation immediately comes along next, finds one of the > other invalidated blocks and writes to it. A crash/recovery leaves the > invalidated busy block in its original metadata location having already > been written to by the dio. > > > Another solution could have been to restore the record if we decide not > > to go ahead with the allocation, but as we haven't yet committed to > > using the space, there's no sense in thrashing the busy records? > > > > That was my original thought as well. Then after looking through the > code a bit I thought that something like allowing the allocator to > "track" a reusable, but still busy extent until allocation is imminent > might be a bit more straightforward of an implementation given the > layering between the allocator and busy extent tracking code. IOW, we'd > split the busy trim/available and busy invalidate logic into two steps > instead of doing it immediately in the busy trim path. That would allow > the allocator to consider the same set of reusable busy blocks but not > commit to any of them until the allocation search is complete. > > However, either of those options require a bit of thought and rework > (and perhaps some value proposition justification for the complexity) > while the current trim reuse code is pretty much bolted on and broken. > Therefore, I think it's appropriate to fix the bug in one step and > follow up with a different implementation separately. <nod> I'm not sure that's even worth the effort... :) --D > > Assuming I got all that right, > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Thanks. > > Brian > > > --D > > > > > > > This demonstrates that it is fundamentally unsafe to modify busy > > > extent state for extents that are not guaranteed to be allocated. > > > This applies to pretty much all of the code paths that currently > > > trim busy extents for one reason or another. Therefore to address > > > this problem, drop the reuse mechanism from the busy extent trim > > > path. This code already knows how to return partial non-busy ranges > > > of the targeted free extent and higher level code tracks the busy > > > state of the allocation attempt. If a block allocation fails where > > > one or more candidate extents is busy, we force the log and retry > > > the allocation. > > > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > --- > > > fs/xfs/xfs_extent_busy.c | 14 -------------- > > > 1 file changed, 14 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > > > index 3991e59cfd18..ef17c1f6db32 100644 > > > --- a/fs/xfs/xfs_extent_busy.c > > > +++ b/fs/xfs/xfs_extent_busy.c > > > @@ -344,7 +344,6 @@ xfs_extent_busy_trim( > > > ASSERT(*len > 0); > > > > > > spin_lock(&args->pag->pagb_lock); > > > -restart: > > > fbno = *bno; > > > flen = *len; > > > rbp = args->pag->pagb_tree.rb_node; > > > @@ -363,19 +362,6 @@ xfs_extent_busy_trim( > > > continue; > > > } > > > > > > - /* > > > - * If this is a metadata allocation, try to reuse the busy > > > - * extent instead of trimming the allocation. > > > - */ > > > - if (!(args->datatype & XFS_ALLOC_USERDATA) && > > > - !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) { > > > - if (!xfs_extent_busy_update_extent(args->mp, args->pag, > > > - busyp, fbno, flen, > > > - false)) > > > - goto restart; > > > - continue; > > > - } > > > - > > > if (bbno <= fbno) { > > > /* start overlap */ > > > > > > -- > > > 2.26.2 > > > > > >