On Wed, Mar 18, 2020 at 05:34:29PM +0100, Christoph Hellwig wrote: > On Wed, Mar 18, 2020 at 10:48:25AM -0400, Brian Foster wrote: > > > do { > > > - if (xlog_state_iodone_process_iclog(log, iclog, > > > - &ioerror)) > > > + if (XLOG_FORCED_SHUTDOWN(log)) { > > > + xlog_state_do_iclog_callbacks(log, iclog); > > > + wake_up_all(&iclog->ic_force_wait); > > > + continue; > > > + } > > > + > > > > Why do we need to change the flow here? The current code looks like it > > proceeds with the _do_iclog_callbacks() call below.. > > > > As it is, I also don't think this reflects the comment above because if > > we catch the shutdown partway through a loop, the outer loop will > > execute one more time through. That doesn't look like a problem at a > > glance, but I think we should try to retain closer to existing behavior > > by folding the shutdown check into the ic_state check below as well as > > the outer loop conditional. > > True. I think we just need to clear cycled_icloglock in the > shutdown branch. I prefer that flow over falling through to the > main loop body as that clearly separates out the shutdown case. > Sure, but a shutdown can still happen at any point so this is just a duplicate branch to maintain. > > This (and the next patch) also raises the issue of whether to maintain > > state validity once the iclog ioerror state goes away. Currently we see > > the IOERROR state and kind of have free reign on busting through the > > normal runtime logic to clear out callbacks, etc. on the iclog > > regardless of what the pre-error state was. It certainly makes sense to > > continue to do that based on XLOG_FORCED_SHUTDOWN(), but the iclog state > > sort of provides a platform that allows us to do that because any > > particular context can see it and handle an iclog with care. With > > IOERROR replaced with the (potentially racy) log flag check, I think we > > should try to maintain the coherence of other states wherever possible. > > IOW, XLOG_FORCED_SHUTDOWN() means we can run callbacks and abort and > > whatnot, but we should probably still consider and update the iclog > > state as we progress (as opposed to leaving it in the DONE_SYNC state, > > for example) because there's no guarantee some other context will > > (always) behave just as it did with IOERROR. > > I actually very much disagree with that, and this series moves into > the other direction. We only really changes the states when > writing to iclogs, syncing them to disk, and I/O completion. And > all the paths just error out at a high level when the log is shut > down, so there is no need to move the state along. Faking state > changes when they don't correspond to the actual I/O just seems like > a really bad idea. > I think you're misreading me. I'm not suggesting to fake state changes. I'd argue that's actually what the special case shutdown branch does. And to the contrary, this patch already implements what I'm suggesting, it's just not consistent behavior.. First, we basically already go from whatever state we're in to "logical CALLBACK" during shutdown. This is just forcibly implemented via the IOERROR state. With IOERROR eventually removed, this highlights things like whether it's actually safe to make some of those arbitrary transitions. It's actually not, because going from WANT_SYNC -> CALLBACK is a potential use after free vector of the CIL ctx (as soon as the ctx is added to the callback list in the CIL push code). This is yet another functional problem that should be fixed before removing IOERROR, IMO (and is reproducible via kasan splat, btw). At this point I think some of these shutdown checks associated with CALLBACK are simply to ensure IOERROR remains persistent once it's set on an iclog. We don't need to carry that logic around if IOERROR is going away. SYNCING -> CALLBACK is another hokey transition in the existing code, even if it doesn't currently manifest in a bug that I can see, because we should probably still expect (wait for) an I/O completion despite that the filesystem had shutdown in the meantime. Fixing that one might require tweaks to how the shutdown code actually works (i.e. waiting on an I/O vs. running callbacks while in-flight). It's not immediately clear to me what the best solution is for that, but I suspect it could tie in with fixing the problem noted above. With regard to this patch, consider that shutdown can happen at any point and xlog_state_do_iclog_callbacks() cycles icloglock. That means that as of this patch, we actually can go from IOERROR -> DIRTY and possibly from DIRTY -> ACTIVE depending on where the iclog lies in the list. Removing IOERROR will subtley change that behavior yet again to make the latter transition potentially more likely. Note that I think that's probably fine. What I'm suggesting is to just drop the duplicate shutdown branch and instead lets evaluate whether some of the code these checks intend to avoid is really problematic. I don't think it is in the completion path since we're just resetting in-core headers and such. That means we could probably just let the iclogs fall through those state transitions naturally, reduce the number of shutdown checks splattered throughout the code and simplify the overall error handling logic in the process by handling all iclogs consistently during shutdown. > Also if you look at what state checks are left, the are all (except > for the debug check in xfs_log_unmount_verify_iclog) under > l_icloglock and guarded by a shutdown check. > That's not quite enough IMO. I think the whole IOERROR problem is not primarily a matter of mechanically factoring it out. It's fragile functionality that should be fixed/simplified first before there's any real value to removing IOERROR. Brian