Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 17 Jun 2021 09:15:01 -0400

On Thu, Jun 17, 2021 at 08:20:54AM +1000, Dave Chinner wrote:
> On Wed, Jun 16, 2021 at 10:57:29AM -0400, Brian Foster wrote:
> > On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote:
> > > > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> > > So I went and looked at all the cases where we check
> > > ic_prev->ic_state to make decisions. There's only one place that
> > > does that: log forces. The sync transaction/log force aggregation
> > > code is the only thing that checks the state of the previous iclog,
> > > and it appears to avoid the "future iclog" problem in two ways.
> > > First, it only checks the previous iclog state when the current head
> > > iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it
> > > ic-prev is a future iclog, it only waits on the ic_write_wait queue,
> > > which means the log force gets woken when IO completion occurs and
> > > the iclog transitions to DONE_SYNC, hence it never waits for
> > > ordering of callbacks (other log force code does that).
> > > 
> > > IOWs, there's only one place that actually tries to use the previous
> > > iclog state for making decisions, and it does doesn't really handle
> > > the case we need to handle here which is determining if the previous
> > > iclog is a past or future iclog. I does, however, handle the case of
> > > incorrect waiting safely.
> > > 
> > > That is, what we need to wait for in the CIL push is the completion
> > > of the physical IO to be signalled, not for the callbacks associated
> > > with that iclog to run. IOWs, neat as it may be,
> > > xlog_wait_on_iclog() is not the right function to be using here. We
> > > should be waiting on the ic_write_wait queue to be woken when iclog
> > > IO completion occurs, not the ic_force_wait queue that waits on
> > > ordered completion of entire journal checkpoints. We just want to
> > > ensure that all previous iclogs are physically completed before we
> > > do the pre-flush on the commit record IO, we don't need software
> > > processing of those IOs to be complete.
> > > 
> > 
> > Ok, but doesn't that leave a gap in scenarios where iclog completion
> > occurs out of order? I thought the reason for waiting on the force of
> > the previous iclog was to encapsulate that the entire content of the
> > checkpoint has been written to disk, not just the content in the
> > immediately previous iclog.
> 
> Yes. We either have to wait for all the iclogs to be written or
> we have to use ic_force_wait for sequencing.
> 

Right, Ok. That wasn't stated clearly in your earlier comment.

> > If the issue is past/future tense, is there any reason we can't filter
> > against the record header lsn of the previous iclog? E.g., assuming the
> > iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block
> > on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0?
> 
> That's what the patch I left running overnight does:
> 

Ok.

>         /*
>          * If the checkpoint spans multiple iclogs, wait for all previous iclogs
>          * to complete before we submit the commit_iclog. We can't use state
>          * checks for this - ACTIVE can be either a past completed iclog or a
>          * future iclog being filled, while WANT_SYNC through SYNC_DONE can be a
>          * past or future iclog awaiting IO or ordered IO completion to be run.
>          * In the latter case, if it's a future iclog and we wait on it, the we
>          * will hang because it won't get processed through to ic_force_wait
>          * wakeup until this commit_iclog is written to disk.  Hence we use the
>          * iclog header lsn and compare it to the commit lsn to determine if we
>          * need to wait on iclogs or not.
>          */
>         spin_lock(&log->l_icloglock);
>         if (ctx->start_lsn != ctx->commit_lsn) {
>                 struct xlog_in_core     *iclog;
> 
>                 for (iclog = commit_iclog->ic_prev;
>                      iclog != commit_iclog;
>                      iclog = iclog->ic_prev) {
>                         xfs_lsn_t       hlsn;
> 
>                         /*
>                          * If the LSN of the iclog is zero or in the future it
>                          * means it has passed through IO completion and
>                          * activation and hence all previous iclogs have also
>                          * done so. We do not need to wait at all in this case.
>                          */
>                         hlsn = be64_to_cpu(iclog->ic_header.h_lsn);
>                         if (!hlsn || XFS_LSN_CMP(hlsn, ctx->commit_lsn) > 0)
>                                 break;
> 
>                         /*
>                          * If the LSN of the iclog is older than the commit lsn,
>                          * we have to wait on it. Waiting on this via the
>                          * ic_force_wait should also order the completion of all
>                          * older iclogs, too, but we leave checking that to the
>                          * next loop iteration.
>                          */

What is the point of the loop? ISTM the functional change only requires
some minimal tweaking from the first patch you posted.

>                         ASSERT(XFS_LSN_CMP(hlsn, ctx->commit_lsn) < 0);
>                         xlog_wait_on_iclog(iclog);
>                         spin_lock(&log->l_icloglock);
>                 }
> 
>                 /*
>                  * Regardless of whether we need to wait or not, the the
>                  * commit_iclog write needs to issue a pre-flush so that the
>                  * ordering for this checkpoint is correctly preserved down to
>                  * stable storage.
>                  */
>                 commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>         }
> 
> 
> > Hmm.. that also has me wondering if it's possible that the previous
> > iclog might be some other future checkpoint, but then skipping the wait
> > entirely means the checkpoint might not actually be on-disk [1]. I
> 
> If the previous iclog is a future iclog, then all iclogs are future
> iclogs and this iclog is the oldest. The iclog ring only ages in one
> direction....
> 

Hm, yes that makes sense. I might have been confusing the new push
concurrency with the fact that all log writers still must contend over
the head iclog.

> > wonder if we need to do something like track the last lsn and/or iclog
> > touched by the current checkpoint and make sure we wait on that,
> > regardless of where it is in the ring (but as above, I'm still wrapping
> > my head around all of this).
> 
> We have to guarantee the log is completely stable up to the commit
> record, regardless of which checkpoint the iclogs belong to.  If we
> don't do that, then log recovery will consider that the checkpoint
> in the log is not complete before it finds the commit record. i.e.
> there will be a hole in the log where the previous cycle numbers
> show through between the end of the checkpoint and the commit
> record, and hence the head of the log will be pointed at the last
> fully completed checkpoint, not this one.
> 

Yet another thus far undocumented side effect of concurrent pushes..

Yes, I know that recovery has always stamped over incomplete record
blocks out beyond the prospective head block. I'm referring to the fact
that recovery clearly has some historical warts that conflict with this
change in behavior to the runtime logging code and suggesting that the
upstream merge cycle should not be the avenue to identify and resolve
these problems.

It's rather evident to me at this point that this change should be
deferred, tested much more thoroughly (not because it wasn't tested or
anything of that nature, but because our log/crash recovery testing is
historically lacking and crude) and reposted with these issues clearly
documented and addressed, at the very least in a manner that allows for
proper consideration of potential alternatives, etc., as opposed to
racing with an arbitrary upstream release deadline.

(This is of course out of my hands as it's not my tree, not my call,
etc. Just my opinion. I've thrown your latest series into my testbed
regardless.)

> Hence we have to ensure all prior iclogs have completed to stable
> storage before writing the commit record, regardless of whether the
> previous iclogs are part of the current checkpoint or not.
> 
> > [1] Once we work out the functional behavior one way or another, how do
> > we actually test/confirm that the flush behavior works as expected? I
> > feel as though the discussions going all the way to the initial review
> > feedback on this change have provided constant reminders that things
> > like the I/O flush / ordering dependencies are very easy to get wrong.
> > If that is ever the case, it's not clear to me we have any way to detect
> > it. Not that I'd expect to have some explicit 5 second fstest for that
> > problem, but have we analyzed whether the few decent crash recovery
> > tests we do have are sensitive enough to fail? I wonder if we could come
> > up with something clever and more explicit via error injection..
> 
> Given that we've found these problems soon after the code was
> integrated and made available for wider testing, I think it shows
> that, at the integration level, we have sufficiently sensitive crash
> and crash recovery tests to discover issues like this. It's not
> surprising that we may not find things like this at the
> individual developer level because they only have so much time and
> resources at their disposal.
> 

What do the bugs that were found have to do with test coverage for the
updated flush semantics? I called that out separately precisely because
recent patches demonstrate how the logic is subtle and easy to break,
but I don't see any real coverage for it at all. So it would be nice to
think about how we might improve on that.

> IOWs, I think the process is working exactly the way it should be
> working. It is expected that complex, subtle problems like these
> will be found (if they exist) once more people are actively using
> and testing the code in a variety of different environments. THis is
> exactly the function that integration testing is supposed to
> provide...
> 
> So while it's possible that we could have more reliable test to
> exercise this stuff (no real idea how, though), the way we are
> developing and merging and finding much harder to trigger problems
> once the code is available for wider testing shows that our
> engineering process is working as it should be....
> 

Yeah, that's wishful thinking. I made a point to run some tests I've
recently been hammering on (for that bli patch I still can't seem to
progress) because 1.) I know we have a bunch of logging changes in the
pipeline that I otherwise haven't been able to keep up with and 2.) I'm
about to disappear for a few months on leave at any moment and so wanted
to prioritize some testing to at least help verify we maintain
correctness or otherwise get any issues reported to the list. And TBH if
it weren't for 2., this is probably not something I would have
prioritized over some other longer term development work in the first
place.

I'd like to hope that we would have caught these issues before release
regardless, but let's be realistic.. if developer testing doesn't catch
it, upstream review/merge cycle testing doesn't catch it, then for crash
and recovery issues (no regular users are experimenting with failure and
recovery of their filesystems to see if we've found new ways to lose
their data), the odds go way down from there that somebody is going to
catch it before it's released to the wild.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>