On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > The AIL pushing is stalling on log forces when it comes across > > pinned items. This is happening on removal workloads where the AIL > > is dominated by stale items that are removed from AIL when the > > checkpoint that marks the items stale is committed to the journal. > > This results is relatively few items in the AIL, but those that are > > are often pinned as directories items are being removed from are > > still being logged. > > > > As a result, many push cycles through the CIL will first issue a > > blocking log force to unpin the items. This can take some time to > > complete, with tracing regularly showing push delays of half a > > second and sometimes up into the range of several seconds. Sequences > > like this aren't uncommon: > > > > .... > > 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 270ms delay> > > 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 > > 400.099623: xfsaild: first lsn 0x11002f3600 > > 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 > > <wanted 50ms, got 500ms delay> > > 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 > > 400.589349: xfsaild: first lsn 0x1100305000 > > 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 > > <wanted 50ms, got 460ms delay> > > 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 > > 400.950343: xfsaild: first lsn 0x1100317c00 > > 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 200ms delay> > > 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 > > 401.142334: xfsaild: first lsn 0x110032e600 > > 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 > > <wanted 10ms, got 10ms delay> > > 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 > > 401.154328: xfsaild: first lsn 0x1100353000 > > 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 300ms delay> > > 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > > 401.451526: xfsaild: first lsn 0x1100353000 > > 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 > > <wanted 50ms, got 500ms delay> > > 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > > .... > > > > In each of these cases, every AIL pass saw 101 log items stuck on > > the AIL (pinned) with very few other items being found. Each pass, a > > log force was issued, and delay between last/first is the sleep time > > + the sync log force time. > > > > Some of these 101 items pinned the tail of the log. The tail of the > > log does slowly creep forward (first lsn), but the problem is that > > the log is actually out of reservation space because it's been > > running so many transactions that stale items that never reach the > > AIL but consume log space. Hence we have a largely empty AIL, with > > long term pins on items that pin the tail of the log that don't get > > pushed frequently enough to keep log space available. > > > > The problem is the hundreds of milliseconds that we block in the log > > force pushing the CIL out to disk. The AIL should not be stalled > > like this - it needs to run and flush items that are at the tail of > > the log with minimal latency. What we really need to do is trigger a > > log flush, but then not wait for it at all - we've already done our > > waiting for stuff to complete when we backed off prior to the log > > force being issued. > > > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > > still do a blocking flush of the CIL and that is what is causing the > > issue. Hence we need a new interface for the CIL to trigger an > > immediate background push of the CIL to get it moving faster but not > > to wait on that to occur. While the CIL is pushing, the AIL can also > > be pushing. > > > > We already have an internal interface to do this - > > xlog_cil_push_now() - but we need a wrapper for it to be used > > externally. xlog_cil_force_seq() can easily be extended to do what > > we need as it already implements the synchronous CIL push via > > xlog_cil_push_now(). Add the necessary flags and "push current > > sequence" semantics to xlog_cil_force_seq() and convert the AIL > > pushing to use it. > > Overnight testing indicates generic/530 hangs on small logs with > this patch. It essentially runs out of log space because there are > inode cluster buffers permanently pinned by this workload. > > That is, as inodes are unlinked, they repeatedly relog the inode > cluster buffer, and so while the CIL is flushing to unpin the > buffer, a new unlink relogs it and adds a new pin to the buffer. > Hence the log force that the AIL does to unpin pinned items doesn't > actually unpin buffers that are being relogged. > > These buffers only get unpinned when the log actually runs out of > space because they pin the tail of the log. Then the modifications > that are relogging the buffer stop because they fail to get a log > reservation, and the tail of the log does not move forward until the > AIL issues a log force that finally unpins the buffer and writes it > back. > > Essentially, relogged buffers cannot be flushed by the AIL until the > log completely stalls. > > The problem being tripped over here is that we no longer force the > final iclogs in a CIL push to disk - we leave the iclog with the > commit record in it in ACTIVE state, and by not waiting and forcing > all the iclogs to disk, the buffer never gets unpinned because there > isn't any more log pressure to force it out because everything is > blocked on reservation space. > > The solution to this is to have the CIL push change the state of the > commit iclog to WANT_SYNC before it is released. This means the CIL > push will always flush the iclog to disk and the checkpoint will > complete and unpin the buffers. > > Right now, we really only want to do this state switch for these > async pushes - for small sync transactions and fsync we really want > the iclog aggregation that we have now to optimise iclogbuf usage, > so I'll have to pass a new flag through the push code and back into > xlog_write(). That will make this patch behave the same as we > currently do. > Unfortunately I've not yet caught up to reviewing your most recently posted set of log patches so I can easily be missing some context, but when passing through some of the feedback/updates so far this has me rather confused. We discussed this pre-existing CIL behavior in the previous version, I suggested some similar potential behavior change where we would opportunistically send off checkpoint iclogs for I/O a bit earlier than normal and you argued [1] against it. Now it sounds like not only are we implementing that, but it's actually necessary to fix a log hang problem..? What am I missing? The updated iclog behavior does sound more friendly to me than what we currently do (obviously, based on my previous comments), but I am slightly skeptical of how such a change fixes the root cause of a hang. Is this a stall/perf issue or an actual log deadlock? If the latter, what prevents this deadlock on current upstream? Brian [1] https://lore.kernel.org/linux-xfs/20210129222559.GT4662@xxxxxxxxxxxxxxxxxxx/ > In the longer term, we need to change how the AIL deals with pinned > buffers, as the current method is definitely sub-optimal. It also > explains the "everything stops for a second" performance anomalies > I've been seeing for a while in testing. But fixing that is outside > the scope of this patchset, so in the mean time I'll fix this one up > and repost it in a little while. > > FWIW, this is also the likely cause of the "CIL workqueue too deep" > hangs I was seeing with the next patch in the series, too. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >