Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 25 Feb 2021 08:12:12 -0500

On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote:
> On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > The AIL pushing is stalling on log forces when it comes across
> > pinned items. This is happening on removal workloads where the AIL
> > is dominated by stale items that are removed from AIL when the
> > checkpoint that marks the items stale is committed to the journal.
> > This results is relatively few items in the AIL, but those that are
> > are often pinned as directories items are being removed from are
> > still being logged.
> > 
> > As a result, many push cycles through the CIL will first issue a
> > blocking log force to unpin the items. This can take some time to
> > complete, with tracing regularly showing push delays of half a
> > second and sometimes up into the range of several seconds. Sequences
> > like this aren't uncommon:
> > 
> > ....
> >  399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 270ms delay>
> >  400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
> >  400.099623:  xfsaild: first lsn 0x11002f3600
> >  400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
> > <wanted 50ms, got 500ms delay>
> >  400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
> >  400.589349:  xfsaild: first lsn 0x1100305000
> >  400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
> > <wanted 50ms, got 460ms delay>
> >  400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
> >  400.950343:  xfsaild: first lsn 0x1100317c00
> >  400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 200ms delay>
> >  401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
> >  401.142334:  xfsaild: first lsn 0x110032e600
> >  401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
> > <wanted 10ms, got 10ms delay>
> >  401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
> >  401.154328:  xfsaild: first lsn 0x1100353000
> >  401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
> > <wanted 20ms, got 300ms delay>
> >  401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> >  401.451526:  xfsaild: first lsn 0x1100353000
> >  401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
> > <wanted 50ms, got 500ms delay>
> >  401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> > ....
> > 
> > In each of these cases, every AIL pass saw 101 log items stuck on
> > the AIL (pinned) with very few other items being found. Each pass, a
> > log force was issued, and delay between last/first is the sleep time
> > + the sync log force time.
> > 
> > Some of these 101 items pinned the tail of the log. The tail of the
> > log does slowly creep forward (first lsn), but the problem is that
> > the log is actually out of reservation space because it's been
> > running so many transactions that stale items that never reach the
> > AIL but consume log space. Hence we have a largely empty AIL, with
> > long term pins on items that pin the tail of the log that don't get
> > pushed frequently enough to keep log space available.
> > 
> > The problem is the hundreds of milliseconds that we block in the log
> > force pushing the CIL out to disk. The AIL should not be stalled
> > like this - it needs to run and flush items that are at the tail of
> > the log with minimal latency. What we really need to do is trigger a
> > log flush, but then not wait for it at all - we've already done our
> > waiting for stuff to complete when we backed off prior to the log
> > force being issued.
> > 
> > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> > still do a blocking flush of the CIL and that is what is causing the
> > issue. Hence we need a new interface for the CIL to trigger an
> > immediate background push of the CIL to get it moving faster but not
> > to wait on that to occur. While the CIL is pushing, the AIL can also
> > be pushing.
> > 
> > We already have an internal interface to do this -
> > xlog_cil_push_now() - but we need a wrapper for it to be used
> > externally. xlog_cil_force_seq() can easily be extended to do what
> > we need as it already implements the synchronous CIL push via
> > xlog_cil_push_now(). Add the necessary flags and "push current
> > sequence" semantics to xlog_cil_force_seq() and convert the AIL
> > pushing to use it.
> 
> Overnight testing indicates generic/530 hangs on small logs with
> this patch. It essentially runs out of log space because there are
> inode cluster buffers permanently pinned by this workload.
> 
> That is, as inodes are unlinked, they repeatedly relog the inode
> cluster buffer, and so while the CIL is flushing to unpin the
> buffer, a new unlink relogs it and adds a new pin to the buffer.
> Hence the log force that the AIL does to unpin pinned items doesn't
> actually unpin buffers that are being relogged.
> 
> These buffers only get unpinned when the log actually runs out of
> space because they pin the tail of the log. Then the modifications
> that are relogging the buffer stop because they fail to get a log
> reservation, and the tail of the log does not move forward until the
> AIL issues a log force that finally unpins the buffer and writes it
> back.
> 
> Essentially, relogged buffers cannot be flushed by the AIL until the
> log completely stalls.
> 
> The problem being tripped over here is that we no longer force the
> final iclogs in a CIL push to disk - we leave the iclog with the
> commit record in it in ACTIVE state, and by not waiting and forcing
> all the iclogs to disk, the buffer never gets unpinned because there
> isn't any more log pressure to force it out because everything is
> blocked on reservation space.
> 
> The solution to this is to have the CIL push change the state of the
> commit iclog to WANT_SYNC before it is released. This means the CIL
> push will always flush the iclog to disk and the checkpoint will
> complete and unpin the buffers.
> 
> Right now, we really only want to do this state switch for these
> async pushes - for small sync transactions and fsync we really want
> the iclog aggregation that we have now to optimise iclogbuf usage,
> so I'll have to pass a new flag through the push code and back into
> xlog_write(). That will make this patch behave the same as we
> currently do.
> 

Unfortunately I've not yet caught up to reviewing your most recently
posted set of log patches so I can easily be missing some context, but
when passing through some of the feedback/updates so far this has me
rather confused. We discussed this pre-existing CIL behavior in the
previous version, I suggested some similar potential behavior change
where we would opportunistically send off checkpoint iclogs for I/O a
bit earlier than normal and you argued [1] against it. Now it sounds
like not only are we implementing that, but it's actually necessary to
fix a log hang problem..? What am I missing?

The updated iclog behavior does sound more friendly to me than what we
currently do (obviously, based on my previous comments), but I am
slightly skeptical of how such a change fixes the root cause of a hang.
Is this a stall/perf issue or an actual log deadlock? If the latter,
what prevents this deadlock on current upstream?

Brian

[1] https://lore.kernel.org/linux-xfs/20210129222559.GT4662@xxxxxxxxxxxxxxxxxxx/

> In the longer term, we need to change how the AIL deals with pinned
> buffers, as the current method is definitely sub-optimal. It also
> explains the "everything stops for a second" performance anomalies
> I've been seeing for a while in testing. But fixing that is outside
> the scope of this patchset, so in the mean time I'll fix this one up
> and repost it in a little while.
> 
> FWIW, this is also the likely cause of the "CIL workqueue too deep"
> hangs I was seeing with the next patch in the series, too.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>