On Fri, Jul 01, 2016 at 06:30:12PM -0400, Brian Foster wrote: > On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote: > > On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote: > > > Newly allocated XFS metadata buffers are added to the LRU once the hold > > > count is released, which typically occurs after I/O completion. There is > > > no other mechanism at current that tracks the existence or I/O state of > > > a new buffer. Further, readahead I/O tends to be submitted > > > asynchronously by nature, which means the I/O can remain in flight and > > > actually complete long after the calling context is gone. This means > > > that file descriptors or any other holds on the filesystem can be > > > released, allowing the filesystem to be unmounted while I/O is still in > > > flight. When I/O completion occurs, core data structures may have been > > > freed, causing completion to run into invalid memory accesses and likely > > > to panic. > > > > > > This problem is reproduced on XFS via directory readahead. A filesystem > > > is mounted, a directory is opened/closed and the filesystem immediately > > > unmounted. The open/close cycle triggers a directory readahead that if > > > delayed long enough, runs buffer I/O completion after the unmount has > > > completed. > > > > > > To work around this problem, add readahead buffers to the LRU earlier > > > than other buffers (when the buffer is allocated, specifically). The > > > buffer hold count will ultimately remain until I/O completion, which > > > means any shrinker activity will skip the buffer until then. This makes > > > the buffer visible to xfs_wait_buftarg(), however, which ensures that an > > > unmount or quiesce waits for I/O completion appropriately. > > > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > --- > > > > > > This addresses the problem reproduced by the recently posted xfstests > > > test: > > > > > > http://thread.gmane.org/gmane.comp.file-systems.fstests/2740 > > > > > > This could probably be made more involved, i.e., to create another list > > > of buffers in flight or some such. This seems more simple/sane to me, > > > however, and survives my testing so far... > > > > > > Brian > > > > > > fs/xfs/xfs_buf.c | 12 ++++++++++++ > > > 1 file changed, 12 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > index 4665ff6..3f03df9 100644 > > > --- a/fs/xfs/xfs_buf.c > > > +++ b/fs/xfs/xfs_buf.c > > > @@ -590,8 +590,20 @@ xfs_buf_get_map( > > > return NULL; > > > } > > > > > > + /* > > > + * If the buffer found doesn't match the one allocated above, somebody > > > + * else beat us to insertion and we can toss the new one. > > > + * > > > + * If we did add the buffer and it happens to be readahead, add to the > > > + * LRU now rather than waiting until the hold is released. Otherwise, > > > + * the buffer is not visible to xfs_wait_buftarg() while in flight and > > > + * nothing else prevents an unmount before I/O completion. > > > + */ > > > if (bp != new_bp) > > > xfs_buf_free(new_bp); > > > + else if (flags & XBF_READ_AHEAD && > > > + list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) > > > + atomic_inc(&bp->b_hold); > > > > This doesn't sit right with me. The LRU is for "unused" objects, and > > readahead objects are not unused until IO completes and nobody is > > waiting on them. > > > > As it is, this points out another problem with readahead buffers - > > they aren't actually being cached properly because b_lru_ref == 0, > > which means they are immediately reclaimed on IO completion rather > > than being added to the LRU.... > > > > I also think that it's not sufficient to cover the generic case of > > async IO that has no waiter. i.e. we could do get_buf, submit async > > write, drop submitter reference, and now we have the same problem > > but on a write. i.e. this problem is and async IO issue, not a > > readahead issue. > > > > I think that it might be better to fix it by doing this: > > > > 1. ensure async IO submission always has b_lru_ref set, and > > if it isn't, set it to 1. This ensures the buffer will be > > added to the LRU on completion if it isn't already there. > > > > 2. keep a count of async buffer IO in progress. A per-cpu > > counter in the buftarg will be fine for this. Increment in > > xfs_buf_submit(), decrement in the xfs_buf_rele() call from > > xfs_buf_iodone() once we've determined if the buffer needs > > adding to the LRU or not. > > > > 3. make xfs_wait_buftarg() wait until the async IO count > > goes to zero before it gives up trying to release buffers on > > the LRU. > > > > After playing with this a bit this afternoon, I don't think it is so > straightforward to maintain consistency between xfs_buf_submit() and > xfs_buf_rele(). Some buffers are actually never released (superblock, > log buffers). Other buffers can actually be submitted for I/O multiple > times before they are ultimately released (e.g., log recovery buffer > read -> delwri submission). > I think I can get around these problems by skipping all uncached I/O and maintaining a per-buffer I/O count that is sunk into the global buftarg count once the buffer is released. E.g., something like the following patch. Not fully tested, but works on some quick tests... diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 4665ff6..8a04b66 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -177,6 +177,7 @@ _xfs_buf_alloc( XB_SET_OWNER(bp); bp->b_target = target; bp->b_flags = flags; + atomic_set(&bp->b_io_count, 0); /* * Set length and io_length to the same value initially. @@ -865,6 +866,21 @@ xfs_buf_hold( atomic_inc(&bp->b_hold); } +static inline void +xfs_buf_rele_iocount( + struct xfs_buf *bp) +{ + int val; + + val = atomic_read(&bp->b_io_count); + if (!val) + return; + + atomic_sub(val, &bp->b_io_count); + percpu_counter_add(&bp->b_target->bt_io_count, -val); + wake_up(&bp->b_target->bt_io_wait); +} + /* * Releases a hold on the specified buffer. If the * the hold count is 1, calls xfs_buf_free. @@ -880,8 +896,10 @@ xfs_buf_rele( if (!pag) { ASSERT(list_empty(&bp->b_lru)); ASSERT(RB_EMPTY_NODE(&bp->b_rbnode)); - if (atomic_dec_and_test(&bp->b_hold)) + if (atomic_dec_and_test(&bp->b_hold)) { + xfs_buf_rele_iocount(bp); xfs_buf_free(bp); + } return; } @@ -890,6 +908,9 @@ xfs_buf_rele( ASSERT(atomic_read(&bp->b_hold) > 0); if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) { spin_lock(&bp->b_lock); + + xfs_buf_rele_iocount(bp); + if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) { /* * If the buffer is added to the LRU take a new @@ -1277,6 +1298,18 @@ _xfs_buf_ioapply( rw |= REQ_META; /* + * XXX: uncached check indirectly filters out the sb buffer and log + * buffers (possibly others?), that are held and never released to the + * LRU + */ + if (bp->b_flags & XBF_ASYNC && + bp->b_bn != XFS_BUF_DADDR_NULL && /* uncached */ + atomic_read(&bp->b_lru_ref) && list_empty(&bp->b_lru)) { + percpu_counter_inc(&bp->b_target->bt_io_count); + atomic_inc(&bp->b_io_count); + } + + /* * Walk all the vectors issuing IO on them. Set up the initial offset * into the buffer and the desired IO size before we start - * _xfs_buf_ioapply_vec() will modify them appropriately for each @@ -1533,6 +1566,8 @@ xfs_wait_buftarg( * ensure here that all reference counts have been dropped before we * start walking the LRU list. */ + wait_event(btp->bt_io_wait, + (percpu_counter_sum(&btp->bt_io_count) == 0)); drain_workqueue(btp->bt_mount->m_buf_workqueue); /* loop until there is nothing left on the lru list. */ @@ -1629,6 +1664,8 @@ xfs_free_buftarg( struct xfs_buftarg *btp) { unregister_shrinker(&btp->bt_shrinker); + ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0); + percpu_counter_destroy(&btp->bt_io_count); list_lru_destroy(&btp->bt_lru); if (mp->m_flags & XFS_MOUNT_BARRIER) @@ -1693,6 +1730,10 @@ xfs_alloc_buftarg( if (list_lru_init(&btp->bt_lru)) goto error; + if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL)) + goto error; + init_waitqueue_head(&btp->bt_io_wait); + btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count; btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan; btp->bt_shrinker.seeks = DEFAULT_SEEKS; diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 8bfb974..2518d09 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -115,6 +115,10 @@ typedef struct xfs_buftarg { /* LRU control structures */ struct shrinker bt_shrinker; struct list_lru bt_lru; + + /* XXX: would atomic_t suffice? */ + struct percpu_counter bt_io_count; + wait_queue_head_t bt_io_wait; } xfs_buftarg_t; struct xfs_buf; @@ -183,6 +187,7 @@ typedef struct xfs_buf { unsigned int b_page_count; /* size of page array */ unsigned int b_offset; /* page offset in first page */ int b_error; /* error code on I/O */ + atomic_t b_io_count; /* * async write failure retry count. Initialised to zero on the first > I have a semi-functional patch that holds more of a pure I/O count, > which means the count is decremented immediately in xfs_buf_ioend() > rather than deferred to release. One downside is that while this > technically still resolves the original problem, it's racy in that the > count is dropped before the buffer is added to the LRU. This still works > for the original problem because we also drain the ioend workqueue in > xfs_wait_buftarg(), but it's not correct because we allow for > non-deferred completion in the event of I/O errors (i.e., > xfs_buf_ioend() called directly from xfs_buf_submit()). > > Brian > > > That will ensure readahead buffers are cached, and we capture both > > async read and async write buffers in xfs_wait_buftarg(). > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > > > _______________________________________________ > > xfs mailing list > > xfs@xxxxxxxxxxx > > http://oss.sgi.com/mailman/listinfo/xfs > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs