On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > Ok, so we don't want geometry changes transactions in the same CIL > > > checkpoint as alloc related transactions that might depend on the > > > geometry changes. That seems reasonable and on a first pass I have an > > > idea of what this is doing, but the description is kind of vague. > > > Obviously this fixes an issue on the recovery side (since I've tested > > > it), but it's not quite clear to me from the description and/or logic > > > changes how that issue manifests. > > > > > > Could you elaborate please? For example, is this some kind of race > > > situation between an allocation request and a growfs transaction, where > > > the former perhaps sees a newly added AG between the time the growfs > > > transaction commits (applying the sb deltas) and it actually commits to > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > AG into the same checkpoint that adds the AG? > > > > This is based on the feedback by Dave on the previous version: > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@xxxxxxxxxxxxxxxxxxx/ > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > not sure I'd go straight to this change given the situation... > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > with my test case, so I'm not actually sure how to get more updates > > into the check checkpoint. I'll try your exercisers if it could hit > > that. > > > > Ok, that explains things a bit. My observation is that the first 5 > patches or so address the mount failure problem, but from there I'm not > reproducing much difference with or without the final patch. Does this change to flush the log after committing the new sb fix the recovery problems on older kernels? I /think/ that's the point of this patch. > Either way, > I see aborts and splats all over the place, which implies at minimum > this isn't the only issue here. Ugh. I've recently noticed the long soak logrecovery test vm have seen a slight tick up in failure rates -- random blocks that have clearly had garbage written to them such that recovery tries to read the block to recover a buffer log item and kaboom. At this point it's unclear if that's a problem with xfs or somewhere else. :( > So given that 1. growfs recovery seems pretty much broken, 2. this > particular patch has no straightforward way to test that it fixes > something and at the same time doesn't break anything else, and 3. we do I'm curious, what might break? Was that merely a general comment, or do you have something specific in mind? (iows: do you see more string to pull? :)) > have at least one fairly straightforward growfs/recovery test in the > works that reliably explodes, personally I'd suggest to split this work > off into separate series. > > It seems reasonable enough to me to get patches 1-5 in asap once they're > fully cleaned up, and then leave the next two as part of a followon > series pending further investigation into these other issues. As part of > that I'd like to know whether the recovery test reproduces (or can be > made to reproduce) the issue this patch presumably fixes, but I'd also > settle for "the grow recovery test now passes reliably and this doesn't > regress it." But once again, just my .02. Yeah, it's too bad there's no good way to test recovery with older kernels either. :( --D > Brian > >