Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 13 Jul 2016 08:49:25 -0400

On Wed, Jul 13, 2016 at 07:32:26AM -0400, Brian Foster wrote:
> On Wed, Jul 13, 2016 at 09:57:52AM +1000, Dave Chinner wrote:
> > On Tue, Jul 12, 2016 at 01:22:59PM -0400, Brian Foster wrote:
> > > On Tue, Jul 12, 2016 at 08:03:15AM -0400, Brian Foster wrote:
> > > > On Tue, Jul 12, 2016 at 08:44:51AM +1000, Dave Chinner wrote:
> > > > > On Mon, Jul 11, 2016 at 11:29:22AM -0400, Brian Foster wrote:
> > > > > > On Mon, Jul 11, 2016 at 09:52:52AM -0400, Brian Foster wrote:
> > > > > > ...
> > > > > > > So what is your preference out of the possible approaches here? AFAICS,
> > > > > > > we have the following options:
> > > > > > > 
> > > > > > > 1.) The original "add readahead to LRU early" approach.
> > > > > > > 	Pros: simple one-liner
> > > > > > > 	Cons: bit of a hack, only covers readahead scenario
> > > > > > > 2.) Defer I/O count decrement to buffer release (this patch).
> > > > > > > 	Pros: should cover all cases (reads/writes)
> > > > > > > 	Cons: more complex (requires per-buffer accounting, etc.)
> > > > > > > 3.) Raw (buffer or bio?) I/O count (no defer to buffer release)
> > > > > > > 	Pros: eliminates some complexity from #2
> > > > > > > 	Cons: still more complex than #1, racy in that decrement does
> > > > > > > 	not serialize against LRU addition (requires drain_workqueue(),
> > > > > > > 	which still doesn't cover error conditions)
> > > > > > > 
> > > > > > > As noted above, option #3 also allows for either a buffer based count or
> > > > > > > bio based count, the latter of which might simplify things a bit further
> > > > > > > (TBD). Thoughts?
> > > > > 
> > > > > Pretty good summary :P
> > > > > 
> > > > > > FWIW, the following is a slightly cleaned up version of my initial
> > > > > > approach (option #3 above). Note that the flag is used to help deal with
> > > > > > varying ioend behavior. E.g., xfs_buf_ioend() is called once for some
> > > > > > buffers, multiple times for others with an iodone callback, that
> > > > > > behavior changes in some cases when an error is set, etc. (I'll add
> > > > > > comments before an official post.)
> > > > > 
> > > > > The approach looks good - I think there's a couple of things we can
> > > > > do to clean it up and make it robust. Comments inline.
> > > > > 
> > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > > index 4665ff6..45d3ddd 100644
> > > > > > --- a/fs/xfs/xfs_buf.c
> > > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > > @@ -1018,7 +1018,10 @@ xfs_buf_ioend(
> > > > > >  
> > > > > >  	trace_xfs_buf_iodone(bp, _RET_IP_);
> > > > > >  
> > > > > > -	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
> > > > > > +	if (bp->b_flags & XBF_IN_FLIGHT)
> > > > > > +		percpu_counter_dec(&bp->b_target->bt_io_count);
> > > > > > +
> > > > > > +	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD | XBF_IN_FLIGHT);
> > > > > >  
> > > > > >  	/*
> > > > > >  	 * Pull in IO completion errors now. We are guaranteed to be running
> > > > > 
> > > > > I think the XBF_IN_FLIGHT can be moved to the final xfs_buf_rele()
> > > > > processing if:
> > > > > 
> > > > > > @@ -1341,6 +1344,11 @@ xfs_buf_submit(
> > > > > >  	 * xfs_buf_ioend too early.
> > > > > >  	 */
> > > > > >  	atomic_set(&bp->b_io_remaining, 1);
> > > > > > +	if (bp->b_flags & XBF_ASYNC) {
> > > > > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > > > +		bp->b_flags |= XBF_IN_FLIGHT;
> > > > > > +	}
> > > > > 
> > > > > You change this to:
> > > > > 
> > > > > 	if (!(bp->b_flags & XBF_IN_FLIGHT)) {
> > > > > 		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > > > 		bp->b_flags |= XBF_IN_FLIGHT;
> > > > > 	}
> > > > > 
> > > > 
> > > > Ok, so use the flag to cap the I/O count and defer the decrement to
> > > > release. I think that should work and addresses the raciness issue. I'll
> > > > give it a try.
> > > > 
> > > 
> > > This appears to be doable, but it reintroduces some ugliness from the
> > > previous approach.
> > 
> > Ah, so it does. Bugger.
> > 
> > > For example, we have to start filtering out uncached
> > > buffers again (if we defer the decrement to release, we must handle
> > > never-released buffers one way or another).
> > 
> > So the problem is limited to the superblock buffer and the iclog
> > buffers, right? How about making that special case explicit via a
> > flag set on the buffer? e.g. XBF_NO_IOCOUNT. THat way the exceptions
> > are clearly spelt out, rather than avoiding all uncached buffers?
> > 
> 
> I think so. I considered a similar approach earlier, but I didn't want
> to spend time tracking down the associated users until the broader
> approach was nailed down. More specifically, I think we could set
> b_lru_ref to 0 on those buffers and use that to bypass the accounting.
> That makes it clear that these buffers are not destined for the LRU and
> alternative synchronization is required (which already exists in the
> form of lock cycles).
> 

It occurs to me that this probably won't work in all cases because
b_lru_ref is decremented naturally as part of the LRU shrinker
mechanism. Disregard this, I'll look into the flag..

Brian

> The rest of the feedback makes sense, so I'll incorporate that and give
> the above a try... thanks.
> 
> Brian
> 
> > > Also, given the feedback on
> > > the previous patch with regard to filtering out non-new buffers from the
> > > I/O count, I've dropped that and replaced it with updates to
> > > xfs_buf_rele() to decrement when the buffer is returned to the LRU (we
> > > either have to filter out buffers already on the LRU at submit time or
> > > make sure that they are decremented when released back to the LRU).
> > > 
> > > Code follows...
> > > 
> > > Brian
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 4665ff6..b7afbac 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -80,6 +80,25 @@ xfs_buf_vmap_len(
> > >  }
> > >  
> > >  /*
> > > + * Clear the in-flight state on a buffer about to be released to the LRU or
> > > + * freed and unaccount from the buftarg. The buftarg I/O count maintains a count
> > > + * of held buffers that have undergone at least one I/O in the current hold
> > > + * cycle (e.g., not a total I/O count). This provides protection against unmount
> > > + * for buffer I/O completion (see xfs_wait_buftarg()) processing.
> > > + */
> > > +static inline void
> > > +xfs_buf_rele_in_flight(
> > > +	struct xfs_buf	*bp)
> > 
> > Not sure about the name: xfs_buf_ioacct_dec()?
> > 
> > > +{
> > > +	if (!(bp->b_flags & _XBF_IN_FLIGHT))
> > > +		return;
> > > +
> > > +	ASSERT(bp->b_flags & XBF_ASYNC);
> > > +	bp->b_flags &= ~_XBF_IN_FLIGHT;
> > > +	percpu_counter_dec(&bp->b_target->bt_io_count);
> > > +}
> > > +
> > > +/*
> > >   * When we mark a buffer stale, we remove the buffer from the LRU and clear the
> > >   * b_lru_ref count so that the buffer is freed immediately when the buffer
> > >   * reference count falls to zero. If the buffer is already on the LRU, we need
> > > @@ -866,30 +885,37 @@ xfs_buf_hold(
> > >  }
> > >  
> > >  /*
> > > - *	Releases a hold on the specified buffer.  If the
> > > - *	the hold count is 1, calls xfs_buf_free.
> > > + * Release a hold on the specified buffer. If the hold count is 1, the buffer is
> > > + * placed on LRU or freed (depending on b_lru_ref).
> > >   */
> > >  void
> > >  xfs_buf_rele(
> > >  	xfs_buf_t		*bp)
> > >  {
> > >  	struct xfs_perag	*pag = bp->b_pag;
> > > +	bool			release;
> > > +	bool			freebuf = false;
> > >  
> > >  	trace_xfs_buf_rele(bp, _RET_IP_);
> > >  
> > >  	if (!pag) {
> > >  		ASSERT(list_empty(&bp->b_lru));
> > >  		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
> > > -		if (atomic_dec_and_test(&bp->b_hold))
> > > +		if (atomic_dec_and_test(&bp->b_hold)) {
> > > +			xfs_buf_rele_in_flight(bp);
> > >  			xfs_buf_free(bp);
> > > +		}
> > >  		return;
> > >  	}
> > >  
> > >  	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
> > >  
> > >  	ASSERT(atomic_read(&bp->b_hold) > 0);
> > > -	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
> > > -		spin_lock(&bp->b_lock);
> > > +
> > > +	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> > > +	spin_lock(&bp->b_lock);
> > > +	if (release) {
> > > +		xfs_buf_rele_in_flight(bp);
> > >  		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
> > >  			/*
> > >  			 * If the buffer is added to the LRU take a new
> > > @@ -900,7 +926,6 @@ xfs_buf_rele(
> > >  				bp->b_state &= ~XFS_BSTATE_DISPOSE;
> > >  				atomic_inc(&bp->b_hold);
> > >  			}
> > > -			spin_unlock(&bp->b_lock);
> > >  			spin_unlock(&pag->pag_buf_lock);
> > >  		} else {
> > >  			/*
> > > @@ -914,15 +939,24 @@ xfs_buf_rele(
> > >  			} else {
> > >  				ASSERT(list_empty(&bp->b_lru));
> > >  			}
> > > -			spin_unlock(&bp->b_lock);
> > >  
> > >  			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
> > >  			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
> > >  			spin_unlock(&pag->pag_buf_lock);
> > >  			xfs_perag_put(pag);
> > > -			xfs_buf_free(bp);
> > > +			freebuf = true;
> > >  		}
> > > +	} else if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru)) {
> > > +		/*
> > > +		 * The buffer is already on the LRU and it holds the only
> > > +		 * reference. Drop the in flight state.
> > > +		 */
> > > +		xfs_buf_rele_in_flight(bp);
> > >  	}
> > 
> > This b_hold check is racy - bp->b_lock is not enough to stabilise
> > the b_hold count. Because we don't hold the buffer semaphore any
> > more, another buffer reference holder can successfully run the above
> > atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock). New references
> > can be taken in xfs_buf_find() so the count could go up, but I
> > think that's fine given the eventual case we care about here is
> > draining references on unmount.
> > 
> > I think this is still ok for draining references, too, because of
> > the flag check inside xfs_buf_rele_in_flight(). If we race on a
> > transition a value of 1, then we end running the branch in each
> > caller. If we race on transition to zero, then the caller that is
> > releasing the buffer will execute xfs_buf_rele_in_flight() and all
> > will be well.
> > 
> > Needs comments, and maybe restructing the code to handle the
> > xfs_buf_rele_in_flight() call up front so it's clear that io
> > accounting is clearly a separate case from the rest of release
> > handling. e.g.
> > 
> > 	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
> > 	spin_lock(&bp->b_lock);
> > 	if (!release) {
> > 		if (!(atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru))
> > 			xfs_buf_ioacct_dec(bp);
> > 		goto out_unlock;
> > 	}
> > 	xfs_buf_ioacct_dec(bp);
> > 
> > 	/* rest of release code, one level of indentation removed */
> > 
> > out_unlock:
> > 	spin_unlock(&bp->b_lock);
> > 
> > 	if (freebuf)
> > 		xfs_buf_free(bp);
> > 
> > 
> > 
> > > @@ -1341,6 +1375,18 @@ xfs_buf_submit(
> > >  	 * xfs_buf_ioend too early.
> > >  	 */
> > >  	atomic_set(&bp->b_io_remaining, 1);
> > > +
> > > +	/*
> > > +	 * Bump the I/O in flight count on the buftarg if we haven't yet done
> > > +	 * so for this buffer. Skip uncached buffers because many of those
> > > +	 * (e.g., superblock, log buffers) are never released.
> > > +	 */
> > > +	if ((bp->b_bn != XFS_BUF_DADDR_NULL) &&
> > > +	    !(bp->b_flags & _XBF_IN_FLIGHT)) {
> > > +		bp->b_flags |= _XBF_IN_FLIGHT;
> > > +		percpu_counter_inc(&bp->b_target->bt_io_count);
> > > +	}
> > 
> > xfs_buf_ioacct_inc()
> > {
> > 	if (bp->b_flags & (XBF_NO_IOACCT | _XBF_IN_FLIGHT))
> > 		return;
> > 	percpu_counter_inc(&bp->b_target->bt_io_count);
> > 	bp->b_flags |= _XBF_IN_FLIGHT;
> > }
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@xxxxxxxxxxx
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs