On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote: > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote: > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote: > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote: > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote: > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote: > > > > > > Ok, so we don't want geometry changes transactions in the same CIL > > > > > > checkpoint as alloc related transactions that might depend on the > > > > > > geometry changes. That seems reasonable and on a first pass I have an > > > > > > idea of what this is doing, but the description is kind of vague. > > > > > > Obviously this fixes an issue on the recovery side (since I've tested > > > > > > it), but it's not quite clear to me from the description and/or logic > > > > > > changes how that issue manifests. > > > > > > > > > > > > Could you elaborate please? For example, is this some kind of race > > > > > > situation between an allocation request and a growfs transaction, where > > > > > > the former perhaps sees a newly added AG between the time the growfs > > > > > > transaction commits (applying the sb deltas) and it actually commits to > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new > > > > > > AG into the same checkpoint that adds the AG? > > > > > > > > > > This is based on the feedback by Dave on the previous version: > > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@xxxxxxxxxxxxxxxxxxx/ > > > > > > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm > > > > not sure I'd go straight to this change given the situation... > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues > > > > > with my test case, so I'm not actually sure how to get more updates > > > > > into the check checkpoint. I'll try your exercisers if it could hit > > > > > that. > > > > > > > > > > > > > Ok, that explains things a bit. My observation is that the first 5 > > > > patches or so address the mount failure problem, but from there I'm not > > > > reproducing much difference with or without the final patch. > > > > > > Does this change to flush the log after committing the new sb fix the > > > recovery problems on older kernels? I /think/ that's the point of this > > > patch. > > > > > > > I don't follow.. growfs always forced the log via the sync transaction, > > right? Or do you mean something else by "change to flush the log?" > > I guess I was typing a bit too fast this morning -- "change to flush the > log to disk before anyone else can get their hands on the superblock". > You're right that xfs_log_sb and data-device growfs already do that. > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug > that this patch fixes, right? > Ah, Ok.. that makes sense. Sounds like it could be.. > > I thought the main functional change of this patch was to hold the > > superblock buffer locked across the force so nothing else can relog the > > new geometry superblock buffer in the same cil checkpoint. Presumably, > > the theory is that prevents recovery from seeing updates to different > > buffers that depend on the geometry update before the actual sb geometry > > update is recovered (because the latter might have been relogged). > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand. > > Either way I think patch could use a more detailed commit log... > > <nod> The commit message should point out that we're fixing a real bug > here, which is that growfsrt doesn't force the log to disk when it > commits the new rt geometry. > Maybe even make it a separate patch to pull apart some of these cleanups from fixes. I was also wondering if the whole locking change is the moral equivalent of locking the sb across the growfs trans (i.e. trans_getsb() + trans_bhold()), at which point maybe that would be a reasonable incremental patch too. > > > > Either way, > > > > I see aborts and splats all over the place, which implies at minimum > > > > this isn't the only issue here. > > > > > > Ugh. I've recently noticed the long soak logrecovery test vm have seen > > > a slight tick up in failure rates -- random blocks that have clearly had > > > garbage written to them such that recovery tries to read the block to > > > recover a buffer log item and kaboom. At this point it's unclear if > > > that's a problem with xfs or somewhere else. :( > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this > > > > particular patch has no straightforward way to test that it fixes > > > > something and at the same time doesn't break anything else, and 3. we do > > > > > > I'm curious, what might break? Was that merely a general comment, or do > > > you have something specific in mind? (iows: do you see more string to > > > pull? :)) > > > > > > > Just a general comment.. > > > > Something related that isn't totally clear to me is what about the > > inverse shrink situation where dblocks is reduced. I.e., is there some > > similar scenario where for example instead of the sb buffer being > > relogged past some other buffer update that depends on it, some other > > change is relogged past a sb update that invalidates/removes blocks > > referenced by the relogged buffer..? If so, does that imply a shrink > > should flush the log before the shrink transaction commits to ensure it > > lands in a new checkpoint (as opposed to ensuring followon updates land > > in a new checkpoint)..? > > I think so. Might we want to do that before and after to be careful? > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we had a log preflush to flush out already committed changes before the grow, I don't think anything really prevents another "problematic" transaction from committing after that preflush. I dunno.. on one hand it does seem like an unlikely thing due to the nature of needing space to be free in order to shrink in the first place, but OTOH if you have something like grow that is rare, not performance sensitive, has a history of not being well tested, and has these subtle ordering requirements that might change indirectly to other transactions, ISTM it could be a wise engineering decision to simplify to the degree possible and find the most basic model that enforces predictable ordering. So for a hacky thought/example, suppose we defined a transaction mode that basically implemented an exclusive checkpoint requirement (i.e. this transaction owns an entire checkpoint, nothing else is allowed in the CIL concurrently). Presumably that would ensure everything before the grow would flush out to disk in one checkpoint, everything concurrent would block on synchronous commit of the grow trans (before new geometry is exposed), and then after that point everything pending would drain into another checkpoint. It kind of sounds like overkill, but really if it could be implemented simply enough then we wouldn't have to think too hard about auditing all other relog scenarios. I'd probably want to see at least some reproducer for this sort of problem to prove the theory though too, even if it required debug instrumentation or something. Hm? > > Anyways, my point is just that if it were me I wouldn't get too deep > > into this until some of the reproducible growfs recovery issues are at > > least characterized and testing is more sorted out. > > > > The context for testing is here [1]. The TLDR is basically that > > Christoph has a targeted test that reproduces the initial mount failure > > and I hacked up a more general test that also reproduces it and > > additional growfs recovery problems. This test does seem to confirm that > > the previous patches address the mount failure issue, but this patch > > doesn't seem to prevent any of the other problems produced by the > > generic test. That might just mean the test doesn't reproduce what this > > fixes, but it's kind of hard to at least regression test something like > > this when basic growfs crash-recovery seems pretty much broken. > > Hmm, if you make a variant of that test which formats with an rt device > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see > similar blowups? Let's see what happens if I do that... > Heh, sounds like so from your followup. Fun times. I guess that test should probably work its way upstream. I made some tweaks locally since last posted to try and make it a little more aggressive, but it didn't repro anything new so not sure how much difference it makes really. Do we want a separate version like yours for the rt case or would you expect to cover both cases in a single test? Brian > --D > > > Brian > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/ > > > > > > have at least one fairly straightforward growfs/recovery test in the > > > > works that reliably explodes, personally I'd suggest to split this work > > > > off into separate series. > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're > > > > fully cleaned up, and then leave the next two as part of a followon > > > > series pending further investigation into these other issues. As part of > > > > that I'd like to know whether the recovery test reproduces (or can be > > > > made to reproduce) the issue this patch presumably fixes, but I'd also > > > > settle for "the grow recovery test now passes reliably and this doesn't > > > > regress it." But once again, just my .02. > > > > > > Yeah, it's too bad there's no good way to test recovery with older > > > kernels either. :( > > > > > > --D > > > > > > > Brian > > > > > > > > > > > > > > > >