Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 25 Feb 2021 08:10:58 +1100

On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
> 
> As a result, many push cycles through the CIL will first issue a
> blocking log force to unpin the items. This can take some time to
> complete, with tracing regularly showing push delays of half a
> second and sometimes up into the range of several seconds. Sequences
> like this aren't uncommon:
> 
> ....
>  399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 270ms delay>
>  400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
>  400.099623:  xfsaild: first lsn 0x11002f3600
>  400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
> <wanted 50ms, got 500ms delay>
>  400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
>  400.589349:  xfsaild: first lsn 0x1100305000
>  400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
> <wanted 50ms, got 460ms delay>
>  400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
>  400.950343:  xfsaild: first lsn 0x1100317c00
>  400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 200ms delay>
>  401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
>  401.142334:  xfsaild: first lsn 0x110032e600
>  401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
> <wanted 10ms, got 10ms delay>
>  401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
>  401.154328:  xfsaild: first lsn 0x1100353000
>  401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 300ms delay>
>  401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
>  401.451526:  xfsaild: first lsn 0x1100353000
>  401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
> <wanted 50ms, got 500ms delay>
>  401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> ....
> 
> In each of these cases, every AIL pass saw 101 log items stuck on
> the AIL (pinned) with very few other items being found. Each pass, a
> log force was issued, and delay between last/first is the sleep time
> + the sync log force time.
> 
> Some of these 101 items pinned the tail of the log. The tail of the
> log does slowly creep forward (first lsn), but the problem is that
> the log is actually out of reservation space because it's been
> running so many transactions that stale items that never reach the
> AIL but consume log space. Hence we have a largely empty AIL, with
> long term pins on items that pin the tail of the log that don't get
> pushed frequently enough to keep log space available.
> 
> The problem is the hundreds of milliseconds that we block in the log
> force pushing the CIL out to disk. The AIL should not be stalled
> like this - it needs to run and flush items that are at the tail of
> the log with minimal latency. What we really need to do is trigger a
> log flush, but then not wait for it at all - we've already done our
> waiting for stuff to complete when we backed off prior to the log
> force being issued.
> 
> Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> still do a blocking flush of the CIL and that is what is causing the
> issue. Hence we need a new interface for the CIL to trigger an
> immediate background push of the CIL to get it moving faster but not
> to wait on that to occur. While the CIL is pushing, the AIL can also
> be pushing.
> 
> We already have an internal interface to do this -
> xlog_cil_push_now() - but we need a wrapper for it to be used
> externally. xlog_cil_force_seq() can easily be extended to do what
> we need as it already implements the synchronous CIL push via
> xlog_cil_push_now(). Add the necessary flags and "push current
> sequence" semantics to xlog_cil_force_seq() and convert the AIL
> pushing to use it.

Overnight testing indicates generic/530 hangs on small logs with
this patch. It essentially runs out of log space because there are
inode cluster buffers permanently pinned by this workload.

That is, as inodes are unlinked, they repeatedly relog the inode
cluster buffer, and so while the CIL is flushing to unpin the
buffer, a new unlink relogs it and adds a new pin to the buffer.
Hence the log force that the AIL does to unpin pinned items doesn't
actually unpin buffers that are being relogged.

These buffers only get unpinned when the log actually runs out of
space because they pin the tail of the log. Then the modifications
that are relogging the buffer stop because they fail to get a log
reservation, and the tail of the log does not move forward until the
AIL issues a log force that finally unpins the buffer and writes it
back.

Essentially, relogged buffers cannot be flushed by the AIL until the
log completely stalls.

The problem being tripped over here is that we no longer force the
final iclogs in a CIL push to disk - we leave the iclog with the
commit record in it in ACTIVE state, and by not waiting and forcing
all the iclogs to disk, the buffer never gets unpinned because there
isn't any more log pressure to force it out because everything is
blocked on reservation space.

The solution to this is to have the CIL push change the state of the
commit iclog to WANT_SYNC before it is released. This means the CIL
push will always flush the iclog to disk and the checkpoint will
complete and unpin the buffers.

Right now, we really only want to do this state switch for these
async pushes - for small sync transactions and fsync we really want
the iclog aggregation that we have now to optimise iclogbuf usage,
so I'll have to pass a new flag through the push code and back into
xlog_write(). That will make this patch behave the same as we
currently do.

In the longer term, we need to change how the AIL deals with pinned
buffers, as the current method is definitely sub-optimal. It also
explains the "everything stops for a second" performance anomalies
I've been seeing for a while in testing. But fixing that is outside
the scope of this patchset, so in the mean time I'll fix this one up
and repost it in a little while.

FWIW, this is also the likely cause of the "CIL workqueue too deep"
hangs I was seeing with the next patch in the series, too.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx