On Thu, Jun 14, 2018 at 08:08:07AM +1000, Dave Chinner wrote: > On Wed, Jun 13, 2018 at 07:05:16AM -0400, Brian Foster wrote: > > If a delwri queue occurs of a buffer that sits on a delwri queue > > wait list, the queue sets _XBF_DELWRI_Q without changing the state > > of ->b_list. This occurs, for example, if another thread beats the > > current delwri waiter thread to the buffer lock after I/O > > completion. Once the waiter acquires the lock, it removes the buffer > > from the wait list and leaves a buffer with _XBF_DELWRI_Q set but > > not populated on a list. This results in a lost buffer submission > > and in turn can result in assert failures due to _XBF_DELWRI_Q being > > set on buffer reclaim or filesystem lockups if the buffer happens to > > cover an item in the AIL. > > > > This problem has been reproduced by repeated iterations of xfs/305 > > on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty > > dquot reclaim races with an xfsaild push of a separate dquot backed > > by the same buffer such that the buffer sits on the reclaim wait > > list at the time xfsaild attempts to queue it. Since the latter > > dquot has been flush locked but the underlying buffer not submitted > > for I/O, the dquot pins the AIL and causes the filesystem to > > livelock. > > > > This race is essentially made possible by the buffer lock cycle > > involved with waiting on a synchronous delwri queue submission. > > Close the race by using synchronous buffer I/O for respective delwri > > queue submission. This means the buffer remains locked across the > > I/O and so is inaccessible from other contexts while in the > > intermediate wait list state. The sync buffer I/O wait mechanism is > > factored into a helper such that sync delwri buffer submission and > > serialization are batched operations. > > > > Designed-by: Dave Chinner <dchinner@xxxxxxxxxx> > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > --- > > Just something I noticed on a initial brief scan: > > > @@ -2013,21 +2017,22 @@ xfs_buf_delwri_submit_buffers( > > trace_xfs_buf_delwri_split(bp, _RET_IP_); > > > > /* > > - * We do all IO submission async. This means if we need > > - * to wait for IO completion we need to take an extra > > - * reference so the buffer is still valid on the other > > - * side. We need to move the buffer onto the io_list > > - * at this point so the caller can still access it. > > + * If we have a wait list, each buffer (and associated delwri > > + * queue reference) transfers to it and is submitted > > + * synchronously. Otherwise, drop the buffer from the delwri > > + * queue and submit async. > > */ > > bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_WRITE_FAIL); > > - bp->b_flags |= XBF_WRITE | XBF_ASYNC; > > + bp->b_flags |= XBF_WRITE; > > if (wait_list) { > > - xfs_buf_hold(bp); > > + bp->b_flags &= ~XBF_ASYNC; > > list_move_tail(&bp->b_list, wait_list); > > - } else > > + __xfs_buf_submit(bp); > > We lose a buffer submission tracepoint here. > Yeah, good catch. What do you think about just killing trace_xfs_buf_submit_wait() and pushing trace_xfs_buf_submit() down into the new helper? It looks like we can already distinguish the io type based on ->b_flags. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html