Re: [PATCH] xfs: close xfs_wait_buftarg() race with buffer lookups

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 23, 2016 at 09:24:10AM +1000, Dave Chinner wrote:
> On Mon, Aug 22, 2016 at 12:31:01PM -0400, Brian Foster wrote:
> > xfs_wait_buftarg() is invoked on unmount and filesystem freeze to drain and
> > free the buffer LRU. We have reports of filesystem freeze hangs with the
> > following call chain:
> > 
> >   ->xfs_wait_buftarg()
> >     ->xfs_log_quiesce()
> >       ->xfs_quiesce_attr()
> >         ->xfs_fs_freeze()
> >           ...
> > 
> > This hang can reproduced with a long enough running fsstress instance
> > running in parallel with a tight freeze/unfreeze loop. The cause of the
> > hang is racing b_hold updates between xfs_wait_buftarg() and
> > _xfs_buf_lookup(). Specifically, buftarg wait path checks whether a
> > buffer has a >1 b_hold count to determine whether to skip the buffer.
> > If b_hold == 1, xfs_wait_buftarg_rele() proceeds to prepare the buffer
> > for the final removal and ultimately calls xfs_buf_rele() to drop the
> > LRU reference.
> 
> So, dumb question: why are we reclaiming the entire buffer cache
> when the filesystem is being frozen?
> 

Not sure. I noticed that briefly, but I guess was too preoccupied with
figuring out what was happening to step back and think about it. It
looks like this changed in commit c75921a7 ("xfs: xfs_quiesce_attr()
should quiesce the log like unmount"), which may or may not have been
motivated by fs freeze. AFAICT, it just looks like a semi-lazy reuse of
the unmount mechanism. E.g., xfs_fs_freeze() was calling
xfs_quiesce_attr() at the time and that commit altered the latter to
reuse xfs_log_quiesce(), which happens to call xfs_wait_buftarg().

> > The problem is that _xfs_buf_find() can acquire a b_hold reference any
> > time after xfs_buftarg_wait_rele() has decided it has the only remaining
> > reference. If this occurs, xfs_wait_buftarg() drops the LRU reference,
> > but the xfs_buf_rele() instance doesn't actually remove the buffer from
> > the LRU due to the _xfs_buf_find() hold.
> 
> Right.
> 
> 
> > At this point b_hold == 1, yet
> > the buffer is held via the _xfs_buf_find() codepath and still remains on
> > the LRU.  Both call chains will ultimately call xfs_buf_rele() on a
> > buffer with b_hold == 1.
> 
> I don't follow your logic here? We've gone form a count of 2 (i.e.
> lru + racing lookup) to a count of 1 by dropping the LRU hold count,
> but now you're saying that both the find and lru still need to call
> xfs_buf_rele()? I'm clearly missing something here - how do we get
> to a hold count of 1 here without dropping a reference twice?
> 

Sorry if the explanation is unclear. I probably relied too much on the
downstream code to describe the problem as this is more difficult to
reproduce upstream. Downstream is also slightly different in that it
predates the list_lru based mechanism.

> (I wrote this to analisys is so I'll leave it here for discussion
> purposes).
> 
> The code paths are:
> 
> Freeze process			lookup process
> 
> xfs_buftarg_wait
> <lru>
> xfs_buftarg_wait_rele()
> b_hold = 1
> lock(b_lock)
> b_lru_ref = 0
> XFS_BSTATE_DISPOSE
> move to dispose list
> unlock(b_lock)
> .....
> 				_xfs_buf_find()
> 				<finds buffer>
> 				atomic_inc(b_hold)	//hold count = 2
> 				.....
> walks dispose list
> xfs_buf_rele()
> atomic_dec(b_hold)					//hold count = 1
> release == false
> <does nothing>
> 
> 				b_lru_ref set to non-zero
> 				.....
> 				xfs_buf_rele()
> 				atomic_dec(b_hold)	// hold_count = 0
> 				!stale, b_lru_ref > 0
> 				add back to LRU,
> 				atomic_inc(b_hold)	// hold_count = 1
> 				~XFS_BSTATE_DISPOSE
> 
> 
> So where does the hold count become 1 while both still need to call
> xfs_buf_rele()?
> 

Hmm, it's possible the hang due to this race is a downstream only
problem and the upstream hang is purely the workqueue issue. Looking at
the upstream code again, the part I missed the first time around is that
the buffer is dropped from the lru by xfs_wait_buftarg() rather than
xfs_buf_rele(). The fact that doesn't happen in the downstream code is
sort of the problem. E.g., in downstream, xfs_wait_buftarg() calls
xfs_buf_rele() which never removes the item from the LRU due to the
elevated hold count, so effectively it ends up calling xfs_buf_rele()
again and again until the buffer is removed from the LRU (and thus
freed). The upstream wait_buftarg() checks b_hold, migrates to the
dispose list, removes from the dispose list, then invokes xfs_buf_rele()
and never sees the buffer again. The lru_list management here should
ensure that xfs_buf_rele() is called only once for the LRU afaict...
(though I'm still not sure passing through xfs_wait_buftarg() with a
buffer being held by another context is quite correct, even if the race
is more of a landmine than anything).

I thought I had confirmed/reproduced this upstream, but it's possible
that occurred before I recognized the workqueue hang was independent.
Let me step back and try to reproduce this with the hack to work around
the workqueue issue and see what happens. If it reoccurs, I'll try to
put together more applicable details. If not, we can probably drop this
for now and incorporate it as a downstream only fix...

Brian

> > Update xfs_buftarg_wait_rele() to properly synchronize the LRU drain
> > against buffer lookups. Atomically decrement and lock the perag
> > associated with the buffer to lock out buffer lookups before
> > xfs_wait_buftarg() determines whether the LRU holds the final reference.
> > Open code freeing of the buffer so we can remove it from the perag
> > rbtree before the perag lock is dropped and guarantee it cannot be
> > looked up once freeing is imminent. Also update xfs_buftarg_wait_rele()
> > to drop and reacquire the lru_lock in the correct order to avoid
> > deadlocks with LRU insertion.
> 
> While this is a solution, I don't yet understand the reace being
> solved, so I can't comment on whether it's the best solution or not.
> It doesn't happen at unmount because there can't be racing lookups
> occurring at that point in unmount, so it comes back to this
> question: why are we invalidating the entire LRU on freeze?
> 
> We really only need to wait for IO to completei during freeze,
> right?  i.e.  xfs-wait_buftarg() is doing 2 things: the first is
> waiting for outstanding IO to complete, and the other is reclaiming
> the LRU. And for freeze, we don't need to do the second?
> 
> So if we stop trying to reclaim the buffer cache on freeze, then the
> whole lookup vs forced reclaim problem will go away?
> 
> > Second, I believe we still have an independent freeze hang upstream due
> > to an issue with how we use drain_workqueue(). Specifically, I see the
> > following warning from kernel/workqueue.c:__queue_work():
> > 
> > 	if (unlikely(wq->flags & __WQ_DRAINING) &&
> > 	    WARN_ON_ONCE(!is_chained_work(wq)))
> > 		return;
> 
> Using flush_workqueue() would solve that problem. We've already
> guaranteed for unmount that there isn't any new incoming work. i.e:
> 
>         while (percpu_counter_sum(&btp->bt_io_count)) {
> 		delay(100);
> 		flush_workqueue(btp->bt_mount->m_buf_workqueue);
> 	}
> 	flush_workqueue(btp->bt_mount->m_buf_workqueue);
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs



[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux