On Tue, Mar 15, 2022 at 12:13:20PM -0700, Darrick J. Wong wrote: > On Tue, Mar 15, 2022 at 05:42:36PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > AIL flushing can get stuck here: > > > > [316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds. > > [316649.007807] Not tainted 5.17.0-rc6-dgc+ #975 > > [316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [316649.011720] task:xfsaild/pmem1 state:D stack:14544 pid:324525 ppid: 2 flags:0x00004000 > > [316649.014112] Call Trace: > > [316649.014841] <TASK> > > [316649.015492] __schedule+0x30d/0x9e0 > > [316649.017745] schedule+0x55/0xd0 > > [316649.018681] io_schedule+0x4b/0x80 > > [316649.019683] xfs_buf_wait_unpin+0x9e/0xf0 > > [316649.021850] __xfs_buf_submit+0x14a/0x230 > > [316649.023033] xfs_buf_delwri_submit_buffers+0x107/0x280 > > [316649.024511] xfs_buf_delwri_submit_nowait+0x10/0x20 > > [316649.025931] xfsaild+0x27e/0x9d0 > > [316649.028283] kthread+0xf6/0x120 > > [316649.030602] ret_from_fork+0x1f/0x30 > > > > in the situation where flushing gets preempted between the unpin > > check and the buffer trylock under nowait conditions: > > > > blk_start_plug(&plug); > > list_for_each_entry_safe(bp, n, buffer_list, b_list) { > > if (!wait_list) { > > if (xfs_buf_ispinned(bp)) { > > pinned++; > > continue; > > } > > Here >>>>>> > > if (!xfs_buf_trylock(bp)) > > continue; > > > > This means submission is stuck until something else triggers a log > > force to unpin the buffer. > > > > To get onto the delwri list to begin with, the buffer pin state has > > already been checked, and hence it's relatively rare we get a race > > between flushing and encountering a pinned buffer in delwri > > submission to begin with. Further, to increase the pin count the > > buffer has to be locked, so the only way we can hit this race > > without failing the trylock is to be preempted between the pincount > > check seeing zero and the trylock being run. > > > > Hence to avoid this problem, just invert the order of trylock vs > > pin check. We shouldn't hit that many pinned buffers here, so > > optimising away the trylock for pinned buffers should not matter for > > performance at all. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > --- > > fs/xfs/xfs_buf.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index b45e0d50a405..8867f143598e 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -2094,12 +2094,13 @@ xfs_buf_delwri_submit_buffers( > > blk_start_plug(&plug); > > list_for_each_entry_safe(bp, n, buffer_list, b_list) { > > if (!wait_list) { > > + if (!xfs_buf_trylock(bp)) > + continue; > > if (xfs_buf_ispinned(bp)) { > > + xfs_buf_unlock(bp); > > pinned++; > > continue; > > Hmm. So I think this means that this function willl skip buffers that > are locked or pinned. The only way that the AIL would encounter this > situation is when a buffer on its list is now locked by a reader thread > or is participating in a transaction. In the reader case this is (one > hopes) ok because the reader won't block on the AIL. > > The tx case is trickier -- transaction allocation can result in an AIL > push if the head is too close to the tail, right? Ordinarily, the AIL > won't find itself unable to write a buffer that's pinning the log > because a transaction holds that buffer -- eventually that tx should > commit, which will unlock the buffer and allow the AIL to make some > progress. > > But -- what if the frontend is running a chained transaction, and it > bjoin'd the buffer to the transaction, tried to roll the transaction, > and the chain runs out of permanent log reservation (because we've > rolled more than logcount times) and we have to wait for more log grant > space? The regrant for the successor tx happens before the commit of > the old tx, so can we livelock the log in this way? > > And doesn't this potential exist regardless of this patch? > > I suspect the answers are 'yes' and 'yes', The answer is yes and yes. The transaction case you talk about is the same as an inode we are running a long tx chain on. Say extent removal on an inode with a few million extents - thinking about this case is somewhat easier to reason about(*) - the inode stays locked across tx commit and is re-joined to the next transaction so the extent removal is atomic from the perspective of the user (i.e. ftruncate() completes before any concurrent IO can make progress). This works from a tx and log perspective because the inode is logged in *every* transaction of the chain, which has the effect of continually moving it forward in the log and AIL as the CIL commits in the background and updates the LSN of the latest modification of the item in the AIL. Hence the item never needs writeback to unpin the tail of the log - the act of committing the latest transaction in the chain will always move it to the head of the log. IOWs, relogging items that remain locked across transaction commits is a requirement of permanent transactions to prevent the deadlock you mention. It's also one of the reasons why we must be able to fit at least two whole checkpoints in the log - so that items that have been relogged can unpin the tail of the log when the second checkpoint completes without requiring writeback of the metadata. There's some more detail in "Introduction to Re-logging in XFS" in Documentation/filesystems/xfs-delayed-logging-design.rst", but the gist of it is above... (*) Buffers have a couple of extra cases where we do have to be *really* careful about rolling transactions. The primary one is INODE_ALLOC buffers, which have to remain pinned in the AIL to their original LSN even when they are relogged (e.g. for unlinked list updates) because we cannot move the tail of the log past the LSN where the inode chunk is initialised on disk without actually initialising the inode chunk on disk. Hence INODE_ALLOC buffers cannot be used as the basis of long running atomic TX chains because they require writeback instead of relogging to unpin the tail of the log. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx