On Fri, Nov 01, 2019 at 10:46:14AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Replace the AG radix tree walking reclaim code with a list_lru > walker, giving us both node-aware and memcg-aware inode reclaim > at the XFS level. This requires adding an inode isolation function to > determine if the inode can be reclaim, and a list walker to > dispose of the inodes that were isolated. > > We want the isolation function to be non-blocking. If we can't > grab an inode then we either skip it or rotate it. If it's clean > then we skip it, if it's dirty then we rotate to give it time to be > cleaned before it is scanned again. > > This congregates the dirty inodes at the tail of the LRU, which > means that if we start hitting a majority of dirty inodes either > there are lots of unlinked inodes in the reclaim list or we've > reclaimed all the clean inodes and we're looped back on the dirty > inodes. Either way, this is an indication we should tell kswapd to > back off. > > The non-blocking isolation function introduces a complexity for the > filesystem shutdown case. When the filesystem is shut down, we want > to free the inode even if it is dirty, and this may require > blocking. We already hold the locks needed to do this blocking, so > what we do is that we leave inodes locked - both the ILOCK and the > flush lock - while they are sitting on the dispose list to be freed > after the LRU walk completes. This allows us to process the > shutdown state outside the LRU walk where we can block safely. > > Because we now are reclaiming inodes from the context that it needs > memory in (memcg and/or node), direct reclaim throttling within the > high level reclaim code in now much more effective. Hence we don't > wait on IO for either kswapd or direct reclaim. However, we have to > tell kswapd to back off if we start hitting too many dirty inodes. > This implies we've wrapped around the LRU and don't have many clean > inodes left to reclaim, so it needs to wait a while for the AIL > pushing to clean some of the remaining reclaimable inodes. > > Keep in mind we don't have to care about inode lock order or > blocking with inode locks held here because a) we are using > trylocks, and b) once marked with XFS_IRECLAIM they can't be found > via the LRU and inode cache lookups will abort and retry. Hence > nobody will try to lock them in any other context that might also be > holding other inode locks. > > Also convert xfs_reclaim_all_inodes() to use a LRU walk to free all > the reclaimable inodes in the filesystem. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- Looks fundamentally sane. Some logic quibbles.. > fs/xfs/xfs_icache.c | 404 +++++++++++++------------------------------- > fs/xfs/xfs_icache.h | 18 +- > fs/xfs/xfs_inode.h | 18 ++ > fs/xfs/xfs_super.c | 46 ++++- > 4 files changed, 190 insertions(+), 296 deletions(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 350f42e7730b..05dd292bfdb6 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -968,160 +968,110 @@ xfs_inode_ag_iterator_tag( > return last_error; > } > > -/* > - * Grab the inode for reclaim. > - * > - * Return false if we aren't going to reclaim it, true if it is a reclaim > - * candidate. > - * > - * If the inode is clean or unreclaimable, return 0 to tell the caller it does > - * not require flushing. Otherwise return the log item lsn of the inode so the > - * caller can determine it's inode flush target. If we get the clean/dirty > - * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks > - * held. > - */ > -STATIC bool > -xfs_reclaim_inode_grab( > - struct xfs_inode *ip, > - int flags, > - xfs_lsn_t *lsn) > +enum lru_status > +xfs_inode_reclaim_isolate( > + struct list_head *item, > + struct list_lru_one *lru, > + spinlock_t *lru_lock, Did we ever establish whether we should cycle the lru_lock during long running scans? > + void *arg) > { > - ASSERT(rcu_read_lock_held()); > - *lsn = 0; > + struct xfs_ireclaim_args *ra = arg; > + struct inode *inode = container_of(item, struct inode, > + i_lru); > + struct xfs_inode *ip = XFS_I(inode); Whitespace damage on the above lines (space indentation vs tabs). > + enum lru_status ret; > + xfs_lsn_t lsn = 0; > + > + /* Careful: inversion of iflags_lock and everything else here */ > + if (!spin_trylock(&ip->i_flags_lock)) > + return LRU_SKIP; > + > + /* if we are in shutdown, we'll reclaim it even if dirty */ > + ret = LRU_ROTATE; > + if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE) && > + !XFS_FORCED_SHUTDOWN(ip->i_mount)) { > + lsn = ip->i_itemp->ili_item.li_lsn; > + ra->dirty_skipped++; > + goto out_unlock_flags; > + } > > - /* quick check for stale RCU freed inode */ > - if (!ip->i_ino) > - return false; > + ret = LRU_SKIP; > + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) > + goto out_unlock_flags; > > - /* > - * Do unlocked checks to see if the inode already is being flushed or in > - * reclaim to avoid lock traffic. If the inode is not clean, return the > - * position in the AIL for the caller to push to. > - */ > - if (!xfs_inode_clean(ip)) { > - *lsn = ip->i_itemp->ili_item.li_lsn; > - return false; > + if (!__xfs_iflock_nowait(ip)) { > + lsn = ip->i_itemp->ili_item.li_lsn; This looks like a potential crash vector if we ever got here with a clean inode. > + ra->dirty_skipped++; > + goto out_unlock_inode; > } > > - if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM)) > - return false; > + if (XFS_FORCED_SHUTDOWN(ip->i_mount)) > + goto reclaim; > > /* > - * The radix tree lock here protects a thread in xfs_iget from racing > - * with us starting reclaim on the inode. Once we have the > - * XFS_IRECLAIM flag set it will not touch us. > - * > - * Due to RCU lookup, we may find inodes that have been freed and only > - * have XFS_IRECLAIM set. Indeed, we may see reallocated inodes that > - * aren't candidates for reclaim at all, so we must check the > - * XFS_IRECLAIMABLE is set first before proceeding to reclaim. > + * Now the inode is locked, we can actually determine if it is dirty > + * without racing with anything. > */ > - spin_lock(&ip->i_flags_lock); > - if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) || > - __xfs_iflags_test(ip, XFS_IRECLAIM)) { > - /* not a reclaim candidate. */ > - spin_unlock(&ip->i_flags_lock); > - return false; > + ret = LRU_ROTATE; > + if (xfs_ipincount(ip)) { > + ra->dirty_skipped++; Hmm.. didn't we have an LSN check here? Altogether, I think the logic in this function would be a lot more simple if we had something like the following: ... /* ret == LRU_SKIP */ if (!xfs_inode_clean(ip)) { ret = LRU_ROTATE; lsn = ip->i_itemp->ili_item.li_lsn; ra->dirty_skipped++; } if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0) ra->lowest_lsn = lsn; return ret; ... as the non-reclaim exit path. Then the earlier logic simply dictates how we process the inode instead of conflating lru processing with lsn/dirty checks. Otherwise for example (based on the current logic), it's not really clear to me whether ->dirty_skipped cares about dirty inodes or just the fact that we skipped an inode. > + goto out_ifunlock; > + } > + if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) { > + lsn = ip->i_itemp->ili_item.li_lsn; > + ra->dirty_skipped++; > + goto out_ifunlock; > } > + ... > @@ -1165,167 +1108,52 @@ xfs_reclaim_inode( ... > void > xfs_reclaim_all_inodes( > struct xfs_mount *mp) > { ... > + while (list_lru_count(&mp->m_inode_lru)) { It seems unnecessary to call this twice per-iter: while ((to_free = list_lru_count(&mp->m_inode_lru))) { ... } Hm? Brian > + struct xfs_ireclaim_args ra; > + long freed, to_free; > + > + xfs_ireclaim_args_init(&ra); > + > + to_free = list_lru_count(&mp->m_inode_lru); > + freed = list_lru_walk(&mp->m_inode_lru, > + xfs_inode_reclaim_isolate, &ra, to_free); > + xfs_dispose_inodes(&ra.freeable); > + > + if (freed == 0) { > + xfs_log_force(mp, XFS_LOG_SYNC); > + xfs_ail_push_all(mp->m_ail); > + } else if (ra.lowest_lsn != NULLCOMMITLSN) { > + xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn); > + } > + cond_resched(); > + } > } > > STATIC int > diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h > index afd692b06c13..86e858e4a281 100644 > --- a/fs/xfs/xfs_icache.h > +++ b/fs/xfs/xfs_icache.h > @@ -49,8 +49,24 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino, > struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino); > void xfs_inode_free(struct xfs_inode *ip); > > +struct xfs_ireclaim_args { > + struct list_head freeable; > + xfs_lsn_t lowest_lsn; > + unsigned long dirty_skipped; > +}; > + > +static inline void > +xfs_ireclaim_args_init(struct xfs_ireclaim_args *ra) > +{ > + INIT_LIST_HEAD(&ra->freeable); > + ra->lowest_lsn = NULLCOMMITLSN; > + ra->dirty_skipped = 0; > +} > + > +enum lru_status xfs_inode_reclaim_isolate(struct list_head *item, > + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg); > +void xfs_dispose_inodes(struct list_head *freeable); > void xfs_reclaim_all_inodes(struct xfs_mount *mp); > -long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan); > > void xfs_inode_set_reclaim_tag(struct xfs_inode *ip); > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index bcfb35a9c5ca..00145debf820 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -270,6 +270,15 @@ static inline int xfs_isiflocked(struct xfs_inode *ip) > > extern void __xfs_iflock(struct xfs_inode *ip); > > +static inline int __xfs_iflock_nowait(struct xfs_inode *ip) > +{ > + lockdep_assert_held(&ip->i_flags_lock); > + if (ip->i_flags & XFS_IFLOCK) > + return false; > + ip->i_flags |= XFS_IFLOCK; > + return true; > +} > + > static inline int xfs_iflock_nowait(struct xfs_inode *ip) > { > return !xfs_iflags_test_and_set(ip, XFS_IFLOCK); > @@ -281,6 +290,15 @@ static inline void xfs_iflock(struct xfs_inode *ip) > __xfs_iflock(ip); > } > > +static inline void __xfs_ifunlock(struct xfs_inode *ip) > +{ > + lockdep_assert_held(&ip->i_flags_lock); > + ASSERT(ip->i_flags & XFS_IFLOCK); > + ip->i_flags &= ~XFS_IFLOCK; > + smp_mb(); > + wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT); > +} > + > static inline void xfs_ifunlock(struct xfs_inode *ip) > { > ASSERT(xfs_isiflocked(ip)); > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 98ffbe42f8ae..096ae31b5436 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -17,6 +17,7 @@ > #include "xfs_alloc.h" > #include "xfs_fsops.h" > #include "xfs_trans.h" > +#include "xfs_trans_priv.h" > #include "xfs_buf_item.h" > #include "xfs_log.h" > #include "xfs_log_priv.h" > @@ -1772,23 +1773,54 @@ xfs_fs_mount( > } > > static long > -xfs_fs_nr_cached_objects( > +xfs_fs_free_cached_objects( > struct super_block *sb, > struct shrink_control *sc) > { > - /* Paranoia: catch incorrect calls during mount setup or teardown */ > - if (WARN_ON_ONCE(!sb->s_fs_info)) > - return 0; > + struct xfs_mount *mp = XFS_M(sb); > + struct xfs_ireclaim_args ra; > + long freed; > > - return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc); > + xfs_ireclaim_args_init(&ra); > + > + freed = list_lru_shrink_walk(&mp->m_inode_lru, sc, > + xfs_inode_reclaim_isolate, &ra); > + xfs_dispose_inodes(&ra.freeable); > + > + /* > + * Deal with dirty inodes. We will have the LSN of > + * the oldest dirty inode in our reclaim args if we skipped any. > + * > + * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than > + * we freed) then we need kswapd to back off once it's scan has been > + * completed. That way it will have some clean inodes once it comes back > + * and can make progress, but make sure we have inode cleaning in > + * progress. > + * > + * Direct reclaim will be throttled by the caller as it winds the > + * priority up. All we need to do is keep pushing on dirty inodes > + * in the background so when we come back progress will be made. > + */ > + if (current_is_kswapd() && ra.dirty_skipped >= freed) { > + if (current->reclaim_state) > + current->reclaim_state->need_backoff = true; > + } > + if (ra.lowest_lsn != NULLCOMMITLSN) > + xfs_ail_push(mp->m_ail, ra.lowest_lsn); > + > + return freed; > } > > static long > -xfs_fs_free_cached_objects( > +xfs_fs_nr_cached_objects( > struct super_block *sb, > struct shrink_control *sc) > { > - return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan); > + /* Paranoia: catch incorrect calls during mount setup or teardown */ > + if (WARN_ON_ONCE(!sb->s_fs_info)) > + return 0; > + > + return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc); > } > > static const struct super_operations xfs_super_operations = { > -- > 2.24.0.rc0 >