On Sat, Jun 19, 2021 at 01:22:49PM -0700, Darrick J. Wong wrote: > On Sat, Jun 19, 2021 at 08:48:30AM +1000, Dave Chinner wrote: > > On Thu, Jun 17, 2021 at 06:26:09PM +1000, Dave Chinner wrote: > > > Hi folks, > > > > > > This is followup from the first set of log fixes for for-next that > > > were posted here: > > > > > > https://lore.kernel.org/linux-xfs/20210615175719.GD158209@locust/T/#mde2cf0bb7d2ac369815a7e9371f0303efc89f51b > > > > > > The first two patches of this series are updates for those patches, > > > change log below. The rest is the fix for the bigger issue we > > > uncovered in investigating the generic/019 failures, being that > > > we're triggering a zero-day bug in the way log recovery assigns LSNs > > > to checkpoints. > > > > > > The "simple" fix of using the same ordering code as the commit > > > record for the start records in the CIL push turned into a lot of > > > patches once I started cleaning it up, separating out all the > > > different bits and finally realising all the things I needed to > > > change to avoid unintentional logic/behavioural changes. Hence > > > there's some code movement, some factoring, API changes to > > > xlog_write(), changing where we attach callbacks to commit iclogs so > > > they remain correctly ordered if there are multiple commit records > > > in the one iclog and then, finally, strictly ordering the start > > > records.... > > > > > > The original "simple fix" I tested last night ran almost a thousand > > > cycles of generic/019 without a log hang or recovery failure of any > > > kind. The refactored patchset has run a couple hundred cycles of > > > g/019 and g/475 over the last few hours without a failure, so I'm > > > posting this so we can get a review iteration done while I sleep so > > > we can - hopefully - get this sorted out before the end of the week. > > > > Update on this so people know what's happening. > > > > Yesterday I found another zero-day bug in the CIL code that triggers > > when a shutdown occurs. > > > > The shutdown processing runs asynchronously and without caring about > > the current state or users of the iclogs. SO when it runs > > xlog_state_do_callbacks() after changing the state of all iclogs to > > XLOG_STATE_IOERROR, it runs the callbacks on all the iclogs and > > frees everything associated with them. > > > > That includes the CIL context structure that xlog_cil_push_now() is > > still working on because it has a referenced iclog that it hasn't > > yet released. > > > > Hence the initial CIL commit that stamps the CIL context with the > > commit lsn -after- it has attached the context to the commit_iclog > > callback list can race with shutdown. This results in a UAF > > situation and an 8 byte memory corruption when we stamp the LSN into > > the context. > > > > The current for-next tree does *much more* with the context after > > the callbacks are attached, which opens up this UAF to both reads > > and writes of free memory. The fix in patch 2, which adds a sleep on > > the previous iclog after attaching the callbacks to the commit iclog > > opens this window even futher. > > > > ANd then the start record ordering patch set moves the commit iclog > > into CIL context structure which we dereference after waiting on the > > previous iclog means we are dereferencing pointers freed memory. > > > > So, basically, before any of these fixes can go forwards, I first > > need to fix the pre-existing CIL push/shutdown race. > > > > And then, after I've rebased all these fixes on that fix and we're > > back to square one and before we do anything else in the log, we > > need to fix the mess that is caused by unco-ordinated shutdown > > changing iclog state and running completions while we still have > > active references to the iclogs and are preparing the iclog for IO. > > XLOG_STATE_IOERROR must be considered harmful at this point in time. > > This puts me in a difficult spot. We're past -rc6, which means that > Linus could tag 5.13.0 tomorrow, and if he does that, whatever's in > for-next needs to have had at least a few days to soak before Linus will > want to pull it upstream. > > Or this could be yet another one of those crazy kernels that goes all > the way to -rc8, in which case there's still time to make small > adjustments. But who knows, I have no schedule visibility. > > However, this doesn't sound like small adjustments. I think it's best > that I withdraw the CIL changes from for-next until we have more time to > fix these issues and make sure that there aren't any bugs that are > easily found by developers. I feel confident enough about everything > between "xfs: log stripe roundoff is a property of the log" and > "xfs: xfs_log_force_lsn isn't passed a LSN" to keep them in for-next. Yup, that's a fair call. I was going to ask you to do this anyway this morning (Monday) because I haven't been able to come up with a magic bullet that fixes everything and makes it all better over the weekend. I'll start a new branch that fixes the UAF bug and the start record ordering, and then rebase the CIL/log scalability patchset on top of that. I'll also pull Christoph's cleanups for the new xlog_write() code on top of that, too. Oh, well, good thing I hadn't deleted the merged branches yet.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx