Re: [PATCH 2/7] xfs: check buffer pin state after locking in delwri_submit

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 15 Mar 2022 12:13:20 -0700

On Tue, Mar 15, 2022 at 05:42:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> AIL flushing can get stuck here:
> 
> [316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
> [316649.007807]       Not tainted 5.17.0-rc6-dgc+ #975
> [316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [316649.011720] task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
> [316649.014112] Call Trace:
> [316649.014841]  <TASK>
> [316649.015492]  __schedule+0x30d/0x9e0
> [316649.017745]  schedule+0x55/0xd0
> [316649.018681]  io_schedule+0x4b/0x80
> [316649.019683]  xfs_buf_wait_unpin+0x9e/0xf0
> [316649.021850]  __xfs_buf_submit+0x14a/0x230
> [316649.023033]  xfs_buf_delwri_submit_buffers+0x107/0x280
> [316649.024511]  xfs_buf_delwri_submit_nowait+0x10/0x20
> [316649.025931]  xfsaild+0x27e/0x9d0
> [316649.028283]  kthread+0xf6/0x120
> [316649.030602]  ret_from_fork+0x1f/0x30
> 
> in the situation where flushing gets preempted between the unpin
> check and the buffer trylock under nowait conditions:
> 
> 	blk_start_plug(&plug);
> 	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
> 		if (!wait_list) {
> 			if (xfs_buf_ispinned(bp)) {
> 				pinned++;
> 				continue;
> 			}
> Here >>>>>>
> 			if (!xfs_buf_trylock(bp))
> 				continue;
> 
> This means submission is stuck until something else triggers a log
> force to unpin the buffer.
> 
> To get onto the delwri list to begin with, the buffer pin state has
> already been checked, and hence it's relatively rare we get a race
> between flushing and encountering a pinned buffer in delwri
> submission to begin with. Further, to increase the pin count the
> buffer has to be locked, so the only way we can hit this race
> without failing the trylock is to be preempted between the pincount
> check seeing zero and the trylock being run.
> 
> Hence to avoid this problem, just invert the order of trylock vs
> pin check. We shouldn't hit that many pinned buffers here, so
> optimising away the trylock for pinned buffers should not matter for
> performance at all.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_buf.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index b45e0d50a405..8867f143598e 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2094,12 +2094,13 @@ xfs_buf_delwri_submit_buffers(
>  	blk_start_plug(&plug);
>  	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
>  		if (!wait_list) {
> +			if (!xfs_buf_trylock(bp))
 +				continue;
>  			if (xfs_buf_ispinned(bp)) {
> +				xfs_buf_unlock(bp);
>  				pinned++;
>  				continue;

Hmm.  So I think this means that this function willl skip buffers that
are locked or pinned.  The only way that the AIL would encounter this
situation is when a buffer on its list is now locked by a reader thread
or is participating in a transaction.  In the reader case this is (one
hopes) ok because the reader won't block on the AIL.

The tx case is trickier -- transaction allocation can result in an AIL
push if the head is too close to the tail, right?  Ordinarily, the AIL
won't find itself unable to write a buffer that's pinning the log
because a transaction holds that buffer -- eventually that tx should
commit, which will unlock the buffer and allow the AIL to make some
progress.

But -- what if the frontend is running a chained transaction, and it
bjoin'd the buffer to the transaction, tried to roll the transaction,
and the chain runs out of permanent log reservation (because we've
rolled more than logcount times) and we have to wait for more log grant
space?  The regrant for the successor tx happens before the commit of
the old tx, so can we livelock the log in this way?

And doesn't this potential exist regardless of this patch?

I suspect the answers are 'yes' and 'yes', which means this patch is ok
to move forward, but this has been bugging me since the V1 of this
patch, which is where I got stuck. :/

--D

>  			}
> -			if (!xfs_buf_trylock(bp))
> -				continue;
>  		} else {
>  			xfs_buf_lock(bp);
>  		}
> -- 
> 2.35.1
>