Re: [PATCH] xfs: don't reuse busy extents on extent trim

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 31 May 2022 11:55:49 -0400

On Sat, May 28, 2022 at 05:23:19PM +0300, Amir Goldstein wrote:
> On Thu, May 26, 2022 at 10:47 PM Brian Foster <bfoster@xxxxxxxxxx> wrote:
> >
> > On Thu, May 26, 2022 at 06:28:23PM +0300, Amir Goldstein wrote:
> > > > > Hi Brian,
> > > > >
> > > > > This patch was one of my selected fixes to backport for v5.10.y.
> > > > > It has a very scary looking commit message and the change seems
> > > > > to be independent of any infrastructure changes(?).
> > > > >
> > > > > The problem is that applying this patch to v5.10.y reliably reproduces
> > > > > this buffer corruption assertion [*] with test xfs/076.
> > > > >
> > > > > This happens on the kdevops system that is using loop devices over
> > > > > sparse files inside qemu images. It does not reproduce on my small
> > > > > VM at home.
> > > > >
> > > > > Normally, I would just drop this patch from the stable candidates queue
> > > > > and move on, but I thought you might be interested to investigate this
> > > > > reliable reproducer, because maybe this system exercises an error
> > > > > that is otherwise rare to hit.
> > > > >
> > > > > It seemed weird to me that NOT reusing the extent would result in
> > > > > data corruption, but it could indicate that reusing the extent was masking
> > > > > the assertion and hiding another bug(?).
> > > > >
> > > >
> > > > Indeed, this does seem like an odd failure. The shutdown on transaction
> > > > cancel implies cancellation of a dirty transaction. This is not
> > > > necessarily corruption as much as just being the generic
> > > > naming/messaging related to shutdowns due to unexpected in-core state.
> > > > The patch in question removes some modifications to in-core busy extent
> > > > state during extent allocation that are fundamentally unsafe in
> > > > combination with how allocation works. This change doesn't appear to
> > > > affect any transaction directly, so the correlation may be indirect.
> > > >
> > > > xfs/076 looks like it's a sparse inode allocation test, which certainly
> > > > seems relevant in that it is stressing the ability to allocate inode
> > > > chunks under free space fragmentation. If this patch further restricts
> > > > extent allocation by removing availability of some set of (recently
> > > > freed, busy) extents, then perhaps there is some allocation failure
> > > > sequence that was previously unlikely enough to mask some poor error
> > > > handling logic or transaction handling (like an agfl fixup dirtying a
> > > > transaction followed by an allocation failure, for example) that we're
> > > > now running into.
> > > >
> > > > > Can you think of another reason to explain the regression this fix
> > > > > introduces to 5.10.y?
> > > > >
> > > >
> > > > Not off the top of my head. Something along the lines of the above seems
> > > > plausible, but that's just speculation at this point.
> > > >
> > > > > Do you care to investigate this failure or shall I just move on?
> > > > >
> > > >
> > > > I think it would be good to understand whether there's a regression
> > > > introduced by this patch, a bug somewhere else or just some impedence
> > > > mismatch in logic between the combination of this change and whatever
> > > > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce
> > > > if I pull just this commit back into latest 5.10.y (5.10.118). I've
> > > > tried with a traditional bdev as well as a preallocated and sparse
> > > > loopback scratch dev.
> > >
> > > I also failed to reproduce it on another VM, but it reproduces reliably
> > > on this system. That's why I thought we'd better use this opportunity.
> > > This system has lots of RAM and disk to spare so I have no problem
> > > running this test in a VM in parallel to my work.
> > >
> > > It is not actually my system, it's a system that Luis has setup for
> > > stable XFS testing and gave me access to, so if the need arises
> > > you could get direct access to the system, but for now, I have no
> > > problem running the test for you.
> > >
> > > > Have you tested this patch (backport) in isolation
> > > > in your reproducer env or only in combination with other pending
> > > > backports?
> > > >
> > >
> > > I tested it on top of 5.10.109 + these 5 patches:
> > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1
> > >
> > > I can test it in isolation if you like. Let me know if there are
> > > other forensics that you would like me to collect.
> > >
> >
> > Hm. Still no luck if I move to .109 and pull in those few patches. I
> > assume there's nothing else potentially interesting about the test env
> > other than the sparse file scratch dev (i.e., default mkfs options,
> > etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt
> > to try and grab a tracepoint dump of the test when it fails (feel free
> > to send directly or upload somewhere as the list may punt it, and please
> > also include the dmesg output that goes along with it) and I can see if
> > that shows anything helpful.
> >
> > I think what we want to know initially is what error code we're
> > producing (-ENOSPC?) and where it originates, and from there we can
> > probably work out how the transaction might be dirty. I'm not sure a
> > trace dump will express that conclusively. If you wanted to increase the
> > odds of getting some useful information it might be helpful to stick a
> > few trace_printk() calls in the various trans cancel error paths out of
> > xfs_create() to determine whether it's the inode allocation attempt that
> > fails or the subsequent attempt to create the directory entry..
> >
> 
> The error (-ENOSPC) comes from this v5.10 code in xfs_dir_ialloc():
> 
>         if (!ialloc_context && !ip) {
>                 *ipp = NULL;
>                 return -ENOSPC;
>         }
> 
> Which theoretically might trip after xfs_ialloc() has marked the transaction
> dirty(?).
> 

Yeah, I think the first part that might have dirtied the transaction at
this point is fixing up the AGFL on a block allocation attempt. Ideally
the AG selection code would prevent taking this step for an AG that
can't satisfy an allocation, but realistically I'm not sure this
approach will ever work perfectly unless it separates out the extent
search algorithm from the prospect of dirtying the transaction and thus
committing to the allocation. This may not be trivial because the AGFL
fixup can require extent allocation itself.

It's not terribly surprising that limiting reuse of busy extents could
increase the likelihood of allocation failure, though I think most cases
should flush busy extents and retry. I wonder a bit whether changes to
the near mode allocation algorithm may have introduced a potential
regression in that regard.

> This specific code is gone with this cleanup series in v5.11:
> 
> https://lore.kernel.org/linux-xfs/20201209112820.114863-1-hsiangkao@xxxxxxxxxx/
> 
> When the $SUBJECT patch is applied to v5.11.16 the test xfs/076 does not fail.
> 
> So either the $SUBJECT patch (from 5.12) is incompatible with v5.10 code
> or the cleanup series somehow managed to make my system not reproduce
> the bug anymore.
> 

I thought this was mostly refactoring and reworking the ugly retry
pattern in the former code vs. major functional changes. I think the
fundamental prospect of allocation failure still exists in current code,
but clearly your tests demonstrate some practical difference. Perhaps
there are some subtle logic changes in that rework that help prevent
this problem.

> I will assume the former and drop this patch from v5.10.y candidates.
> If you want me to continue to research the bug on v5.10 let me know
> what else you want me to check.
> 

I think that's reasonable. Realistically the bug fixed by this patch is
so old and long standing (v3.0?) I don't think it's terribly important
to pull it from v5.12 to v5.10.y unless real users started hitting it.
My curiosity around the cause is moreso to identify whether there's a
bug in mainline that needs fixing..

Brian

> Thanks,
> Amir.
>